DOCUMENT RESUME 



ED 212 087 

AUTHOR f 
TITLE' . 



INSTITUTION 
SPONS AGENCY 
PUB DATE 
CONTRACT ' . 
NOTE 

EDRS PRICE* 
DESCRIPTORS 



EA 014 365 

White, Karl; And Others 

State Refinements to t|ie ESEA Title I Evaluation 
Reporting- System: Utah 1979-80 Project, Final 
Report. 

Office of Education, Salt 
of Education, Washington, 



and 



Lake 
D.C. 



City. 



IDENTIFIERS 



Utah State* 
Department 
May 81 1 
UTP-301-1 
236p. 0 

MF01/PC10 £lus Postage. 

♦Achievement Tests; Elementary Education; Evaluation 
Methods; Norm Referenced Tests; Questionnaires; 
Scores; State Surveys; Tables (Data); Teacher 
Attitudes; *Test Format; ^Testing; Test Theory;. *Test 
Validity; *Tpst Wis^ness 

Elementary Secondary Education Act Title I; *Title I 
Evaluation and Reporting System; *Utah^ 

ABSTRACT m 

To explain discrepancies in Utah's elementary school 
test results under the Elementary and Secondary Education Act's Title 
I Evaluation and Reporting System (TIERS), researchers investigated 
the adequacy and validity of TIERS evaluation models. Model A 
(norm-referenced testing) is used in most Utah school- districts, in 
preference to Models B or .C (both involving comparisons yith control 
groups). The researchers reviewed previous research and 'conducted 
rojects that (1). compared test results under Models A and B, 
lessed how well Utah's tests met Model A' ft Assumptions, (3) 
fed {he effects of test formats, and (.4) examined the impact of 
training students and teachers in test taking and* administer ing. 
Using tests, interviews, and observation, the projects analyzed test 
scores, educators' attitudes, and students \ and teachers' test 
behaviors in several school districts, especially thrfPSalt Lake City 
School District. The results indicate thati(l) Model A overestimates 
Title I'& impact, T2) most 6f Model A's assumptions are met, (3) test* 
format heavily affects test results, and (4) training students and 
teachers to tale and give tests improves scores. Ten appendices 
reproduce cover letters and data collection forms. (RW) 



four 
(2) 
ana 



r pr< 
afs< 

lyfec 



***************** ******************************************************* 

* Reproductions supplied by EDRS a're the best that can be made * 

* «> from the original document. * 
********* ******* ********^*** ************** ***************************** 



■ ■ ■ J . ■ • 

— :— — 1 '■ '. V. 



CO 
O 

ro 



"FINAL REPORT 



STATE REFINEMENTS TO THE ESEA, TITLE I 
EVALUATION AND REPORTING SYSTEM: 
UTAH 1979-80 PROJECT 



ERIC 



US DEPARTMENT OF fDUCATlD* 



MAY 1981 ' 



Submitted by 
UTAH STATE OFFICE OF EDUCATION 
Salt Laka City, Utah 



— . : \T* 



. FINAL REPORT 

State Refinements to the ESEA Title I Evaluations!* 
and Reporting System:- Utah 1979-80 Project 1 ' 



by 

r 

K^rl Wbite> 

Cie Taylor | 
Larry Carce 
Nancy U El 
Utah State Univ 

k 0 

• in cooperation" iTith 
* 

Utah State Off-ice bf Educdtyn 
May, 1981 




iThis work was supported by the Un iter States Department of Education 
under Contract #UTP-301-1. Opinions included herein do not necessaril 
refl££t 'official Education Department policy. 



t 



n 



- - . TABLE OF CONTENTS 

» 

•.' Page 

I. OVERVIEW OFTH^ STUDY s . . . r. . . 1 

Problem Statement 2 

Objectives * 6 

Additional Studies I 7* 

II. REVIEW OF RESEARCH ON ASSUMPTIONS UNDERLING 

"TIERS MODELS '. . 9 

Model Descriptions Jfc , 10 

Review of Literature Pertaining to Model Assumpt forts 13 7 

Assumptions Common to All Models ... m 14 

Assumptions of Model A * 30 

Assumptions of Model 8 41 

Assumptions of Model C . 46 

Comparability of Models * 52 

0 Summary and Conclusions * 54 % 

Validity of Statistical Assumptions . . . .„ t _/ ....... 55 

Proper Implementation ... ^ .... ^ ...... 56 

Model Comparability 57 

lit. A COMPARISON OF MODELS A AND B f . 61 

Procedures . . '. . . . 61 

Results and Discussion . . 66 

* * 
IV. DEGREE TO WHICH ASSUMPTIONS MADE BY WODEU A ARE MET 

IN UTAH/TITLE I EVALUATIONS \ . . S 72 

Prpcedores : , . . . \ . . 72 

Results* . . > 76** 

Interviews* with 'LEA StSff . * 76 

Determination of. On-task Behavior During Toting ....... 814 

Quality of Test Administration .......... -86 

Match ^(fetween Curriculum arid Testing ....... . ... 86 

v Discussion ^ . >. . 92 



ERLC 



TABLE OF CONTENTS (continued) 

* * * 

- V. , THE EFFECT OF ITEM FORMAT ON STUDENTS 1 STANDARDIZED 

READING ACHIEVEMENT TEST SCORES 94 

^ Previous R«ear'ch r . 95 

\ - l * 

Order of \est Items . % * 95 

Test Item Forjmat . . . \ . ¥ m . . . 95 

, , Question and Answer Form N ..... . 97 

fySummary • ....../... . 99 

Method 100 

Test Construction I . . . . v 100 

Procedures 100 

Results 1 .' . . . 102 

Study I ^ 102 

<* Study II . - t 102 

Discussion . . ' 108 

■VI. EFFECTS ON STANDARDIZED ACHIEVEMENT TEST PERFORMANCE 

OF TRAINING TEACHERS, TRAINING STUDENTS, AND^ MOTIVATING 

STUDENTS -• ,. 113 

Procedures • t « 115 

Subjects 115 

Treatment '119 

Dependent Measures . 127 

Observing and Reinforcing : t . 132 

Results and Discussion . . . • 137 

Test Scores * 138 

Teacher Behavior 148 

Student Behavior . . ' " 151 

Test Booklets 154 

Summary • J « . . . . 158 



ERLC 



TABLE OF CONTENTS (continued) ) 
\ * ' ^' Page 

VII. SUMMARY, CONTUSIONS, AND RECOMMENDATIONS • . . 161 

A Comparisory of TIERS Mpdel A and B ..... ;\ 162 

Degree to Whnch Assumptions Made* by Model A^ 

are, Met in l/fcah Title I Evaluations 164* 

The Effect of item Format on Students 1 Scores 

, from Standardize* Achievement- Testing 166 * 

Effects on Standardized Test Performance of 

Training Teachers, Trailing Students and Motivating . • 

Students ........ T ; . . f .* \ 167 

Summary .' \ * 168 

1 s 9 

REFERENCES , m . 170 

K 

APPENDICES V . " 178- 

sAppendix If Letters to District Title I Directors 

Explaining Purpose of Project . . . * . . 178 

Letter, to Principals^? Schools < 

Visited During Project - —187 

Appendi*\3: Memorandum Proviwd to Principals for 

^ftfonning Teachers About the Project 186 

Appendix 4:\Interview Guide Sheet for Collecting Data 
' from LEA Personnel Regarding , Implementation 

of TIERS . . . 188 , 

Appen(J/x 5: Data Collection Form and Definitions of' 
On-T ask/Of f-T ask , Behavior for Classroom • 

Observation . . , f 191 

Appendix 6: Quality of T$st Administration Checklist . .... I 195 
Appendix 7: Interview Guide Sheet for Teachers to 

Prioritize Curriculum Areas * 198 

Appendix 8: Computing Student/Teacher Ratios for 

Title I*Proqrams .*..:( 200 y 

Appendix 9: Letters of Approval ,j Support, & Notification 

Regar<fring Extended Work, Scope Project 207 

Appendix 10: Approval Forms and Letters Related to 

- Videotaping - . . . 213 

i - • * 



ERIC 



s 

CHAPTER I 



OVERVIEW OF THE STUDY 



* During the 1979-80^ school year, Title I services were provided to 
almost 20,000 students in approximately 240 schools throughout the 40 school 
, districts in Utah. The impact of the 'Title I programs on students! 

♦ achievement in these districts was measured in accordance w.ith the Title I 
Evaluation and Reporting System (TIERS) , which had been developed under 
contract to the U.S. Department of Education. 

• f - 

In implementing TIERS, each district in Utah, chose one of three models 
(Model A. - norm referenced, Model B - comparison group, or Model C - special 
regression)! to evaluate the impact on student achievement af Title I ' | 
programs. Each of these models compares the achievement level of students at 
the conclusion of the Title I project with an estimate of the acbievement 

* level which woiild have resulted if the students had not participated in 
Title I. According to tile developers of the evaluation models which underlay 
TIERS, ^each of the three models should yield comparable results *f properly 
implemented (Tallmadge and Wood, 1976) . Such comparability is essential If 
data .from different programs a\e to be aggregated to determine the statewide 
or nationwide effect which Title I programs are*having. on student < * * 
achievemei^. 

'* Districts in Utah have been using TIERS to iome degree as a part of their 
Title I evaluations since the 1977-78 school year. During that year, 
approximately 50* of the state's 40 distrtcts used the USOE models to examine 
student achievement resulting from-Title I programs/ In 1978-79, 90% 'of tfie 

{ 



^ These three.models are not described in detail sirifethey are very 
familiar to most readers. Additional explanation of the models can be found 
in Tajlmadge and Wood (1976). 



districts used one of the models, and during, th« 1979-80 schtroJ year, all of 

the 'distrfcts •■in Utah used one of the evaluation -models propqsed by TIERS* 

although only 33% of the districts were required to report their results to 

the state. As is the case nationwide, most districts fn Utah have selected to 
1 

implement Model A (the norm referenced model). 2 During the 1979-80 school 
year, Model A was. used in- .36 of the 40 Utah school districts. 

t - Problem Statement 

\ 

An analysis of the" initial results from Title I evaluations in Utah, 
which .have used Model A, has raised a number of questions* with Utah State 
Office of Education (SEA) personnel who are responsible for* the statewide 
implementation, coordination, and .evaluation of Title I within Utah. 

First, during the past five to six years, SEA Title I personnel have 

systematically worked to identify wfiich Title I projects are most jeffectivfe.' 

The identification of effective programs has depended on data from 

.standardized achievement testing, on-site visits to the projects, and 
* ■ * 

^interaction with project staff. The experience of SEA personnel in 
conjunction with standardised test data have provided a fairly §ood indicator 
of which di^tricls have historically operated the M best ,,v Title I programs. 
The results from the Model A evaluations, however, sometimes contradict these 
historically based assessments of program quality. Some of those programs, 
which'have traditionally been the "best" programs have shown very limited or 
no gains, and some of those programs, which have always been considered" by SEA 



Zglodel a can implemented using either, a nationally normed 
'statlflardized test (Model Al)*'or a Criterion Referenced Test (Model A2). 
Throughout the remainder of this, paper, "Model A" refers to Model* Al. 



personnel .to^b'e the weakest' programs have demonstrated the best gains- In 

^ many cases, this, seems to be-occurring even though the individual ifiNarams and 

' " ' \ . . ; ^ 

associated personnel have rtpt changed appreciably. *{• * • 

Secondly, unless there are dramatic changes ;n either the type of student 

being served Or in the- nature of Title I programs being provided to' chi Tdr^en\ 

it would seem that th€ impact of Title I wjthin a given district should not 

shift dramatically from year to year.* However, in some distridts where no * 

such changes have occurred, the impact Of Title I programs since Model A was 

'implemented seems to have changed substantial ly .and to have shifted b£ck and 

forth from year to year. • 

In view- of the fact that most districts in Utah have chosen to use 

Y 1 

Mcxlel A because of its greater feasibility, these initial results from the • 
model caused SEA Title I personnel, as well as many local district personnel, 
to become concerned about the validity of Title I evaluation results obtained 
using Model A. Tlns-prptfect was undertaken to'provide further information and 
explanation of the apparent discrepancies of Title revaluation data prior to 
and after the implementation of TIERS* The intent of such a project was to *' 

0 

allow both SEA and LEA Title I personnel to proceed with more Confidence in 
making decisions regarding the impact qf Title I projects. 

In considering the reasons for the inconsistencies in evaluation* results 
a number of factors were identified which might be partially responsible for 

the apparent discrepancies observed vfith the Model A evaluation model. Each 

« j 

of the faetors is discus^d brief ly below* 

1 j 

Inaccuracy of reported information . It is possible that local schools 
and, districts are not reporting data accurately, The TIERS models have 
numerous areas in which arithmetic^ procedural and/or clerical errors could 
occur which would substantially 'alter the results for a tfiven project. 
Stonehill and English (1979) reported that in studies conducted by the 



Department of Education .and various Regional Technical Assistance Centers > 

(TACs)., more than 95% of the districts made some errors when producing their ^ 

• * 

LEA Title I evaluation report. • A 

; • 

This source of error (i.e., the accuracy of reported data) has been a , 

0 

source of concern* to SEA personnel in Utah since districts began to iTfiplement 
y the evaluation models. Efforts to reduce errors: of this type constitute a 
major portion* of the responsibilities of Utah SEA Title I personnel. For 
1 example, njuch of the hand calculation by LEA personnel has been eliminated 
through the development of a computer program' at the SEA which uses raw data 
submitted by LEAs to do most of the necessary computation. .Workshops and 
various standardized forms for reporting raw datV have also bean used to 
further reduce epYors, In addition, much of the assistance provided to Utah 
by the Region VIII TAC has been targeted on this area. As a result of these 
past and ongoiflij activities, there was reasonable confidence among SEA Title I 
Personnel that arithmetic, procedural and/or coding errors would not be k 
major source of error in 1979-^80 Title I ^valuation data-in Utah. 

Violation of Model A assumptions , m second potential source of error in 
the results of Title I evaluation, data obtained using Model A was that some of 
the assumptions made to ensure "proper implejnentation" of the^Qdef^wfice being 
^ violated. . Briefly summarized, the assumptions made by Model A include: 
K 1. Publisher's directions for administering and scoring tests are 

^ i 

u 

* adhered to closely. 

m 

- 2. Selection tests are separate from pretests in ofder to eliminate 

regression towards the mean, 
c 3. Both pre- and posttests are administered clofe to empirical norm - 
dates and appropriate adjustments are made where necessary. 



• r 

i < / 



\ ■ 



er|c ; io 



• 4. An-appropr'iate -level of the test is used in of-der to 'avoid floor 
and/or ceiling effects, ~~7 

5. Make-up tests'dre given withiVtwo wee^r*to students who were absent 
, on testing day. 

6. The norms of the tests being used for pre- and po : sttests are 
^ appropriate for the ^ t 

7. Test content in the standardized tests appropriately marches the 
instructional emphasis of the 1EA, 

Violations of one or more of these assumptions by users of Model A may bias 
results either positively. or negatively. The biasing effect of such 
violations has been discussed to some degree by other' writers (Conklin, 1979; 
KaskowHz and Norwood, 1977; Linn, 1978; Long, Schaffran,. and<Kellogg, 1977; ^ 
Murray, M79; Tallmadge and Horst, 1977). One purpose of Jftis study was to 
explore the frequency and estimate the probable consequences of such 
violations in Utah school districts which are using Model A, 
f results when flod 
at ion for the 0bse 

that Model A may not provide an accurate measure of program impact. The 
theoretical adequacy of Model A#has been questioned by a number 1 of writers « 
(Kaskowitz and Norwood, 1979; Linn, 1978) with particular conpern being 
expressed .about the equipercentile assumption--!' .a^, th^ assumption that if 
the^Ule I program iTTompletely Ineffective, students enrolled /in the v 
. program will maintain their same percentile ranking with respect to' the norm 
jjrcjup from pretest to ^ttest. 

WithVinijor effort already being directed toward> solving'the problems 
created by inaccurately reported data, an examination of the frequency and 
probable consequenaes of violations of modfcl ^assumptions, and an examination 



Validity .of results when .Model A is properly implemented . A third 
possible ^explanation for the Jbserved discrepancies in the TIER^Model A is 



of the validity of Model A wds intended to provide SEA dnd LEA personnel in 
Utah and .other "states with useful information for selecting an appropriate 
model and interpreting the results of Title devaluations. 



Objectives , 



The overall goal of this study was to investigate and*draw conclusions 

about the adequacy and validity of Model A as it is currently being 

implemented in Utah school districts. This goaVwas addressed through the 

accomplishment of the following objectives. * 

* OBJECTIVE #1:. Compare the Estimates of Achievement (Srbwth Dye to a $ 
Title I Project Usifig Model A and Model B with the Same' 
» Group of .Title I -Students (grades Z-4) \ 

Since Model B is generally regarded as the most rigorous of the three 
models, a properly implemented Model B should provide the best indicator of 
the true impact of a Title I program on students' achievement. A comparison 
qf the results: from the' two models would*yield valuable information about the 
validity Of Model A results. 

Previously cbmpleted r studies had compared' the results of two different 
models using actual data (Faddis a^d Arter, 1979; Davidoff, 1978; Kearns) 
1978; Storlie,' 4 Rice, Harvey and Crane, 1979; Stennan and Raf field; 1977) or 
simulated data (Echter^ht,* 1978;-Mandevi lie, 1978; Stennan and Raffield, 
1977).' Most of ttiese studies* have not included Model B as one of the models 
being compared. Fadd c is and Arter (1579) compared results of Mod^l B and Model 
A and found that Model A resulted in, lower estin^teS of ' growth than Model B. 

4 

Although their results were Velated to the objectives of this study, Faddis 
and Arter (1979) onlv considered ninth -grade students, did not explicate' 

adequately how th^contro'T schools for Model B were selected, and had 
.A 

significant dropout rates (55%) between pre- and posttests. 




OBJECTIVE #2:» Analyze the Title I Evaluation Procedures ^ntj Results ' 
* from" Utah' Districts which have Used- Model A to Pinpoint 
the Frequency, and Probable Consequences of Violating Model 
* m ' . A Assumptions ~ " ~" " 7 

This objective was addressed to provide empirical evidence about the 

frequency and extent to which the assumptions #ade by Mo^el A were being 

violated, jn Utah school districts. Data'abfat violations pf Model A 

^assumptions was intended to provide information for estimating whether the^ * 

inconsistencies regarding Model /f evaluation results could be due to improper 

implementation^ the model. The assumptions considered in this analysis were 

referred to. ear Iter. ^ ' " * 



Additional Studies 



JSt As the work Scop^ defined in the contract with the Department of * 
Tducation commences, a number of other activities werp identified which 
related to objectives of the original work scope.* A 'number of thase 
activities were undertaken and accomplished^ project staff. Some of this, 
work was partially ftfnded by another small grant from the Office of -Special 
EdjKation (a student-initiated research* gVant c^ducted by Cie Taylor -apd Karl 
White), other parts were paid f^r by research funds from Utah State 
# , University, and some parts were made possible by more efficient use of project 
resources. This efficient use of resources was possible because project staff 
.were already working w.ith st^ff members in -a number of schools in the Salt 
, Lake District anS many of the same procedures- and instrumentation developed 
for this project ^:ould be used in/the additional studies. In other instances, 
^ some of 'the additional work was initiated because once the project had begun, 
members of the research tearfi had new insights about what kinds of information 

would meaningfully impact on. the problems which had originally motivated this 

# 

study., * ' , • 



9 

ERIC 



r 

t 



■ • . * < * \ 8 

• v > v • 

Consequent ly, J.he research reported in this final report is much broader 

than the *£rkscope^1ginal 1/ outlined in'th'e proposal, , All of the originally 

-.described a^iviti'es were accomplished- and the report v^ill summarize the 

activities -and accomplishments. In addition, the report describes the. 

outcomes and benefits -accompj ished 'as a result of the additional projects 

' 4 . . f * 

which were made-possible because of the existence of the State Refinements 

contract. » ^ 

These additional pr^je^ts^are discussed in Chapter V (The Effect of Item 
Format on Student's Standardized Reading Achievement Test Scores); arid Chapter 
VI (The Effect of Standardized Test Performance of Training Students; Training 
Teachers, and 'Motivating StudTents). The following three'chapters consist of : . 
Chapter II: A Review of Research on Assumptions Underlying TIERS; Chapter 
III: A Comparison of ModeH A^and 8; Chapter* IV:- Degree to Which Assumptions 
Made by Model A are Met in Utah Title I Evaluation.fi Each of these chapters 
includes procedures,, results, arid conclusions. The f inal\chapter of this 
report synthesizes conclusions from s each of the preceding five chapters into a 
brief nummary and rfecommendatiorts concerning the implementation and qp^ra|ion 
(S in Utah Districts.* 9 % * 



9 

ERIC 



• 



u 



' . CHAPTER IP s " 

REVIEW OP RESEARCH ON ASSUMPTIONS , 
\ ' UNDERLYING TIERS MODELS , 

• v - / 

TUTe I of the Elementary and Secondary- Education Act (ESEA) of 1965 
was a massive action on^ the part of the Federal government to upqrade the 
educational 'system of the United States, Basically, -funds are provided for 
educational programs aimed at improving basic educational skills of education- 
ally disadvantaged children. These skills are defined as competencies in 
reading,- math, and language arts. Financial aid is allocated to counties wiib 
high concentrations of poor families. However, Title I students are selected 
upon the basis of educational need (-Stonehtll & English, 1979). * 

Since its inception, ESEA has mandated that schools receivinq Title I 
funds evaluate their proqrams on a yearly basis. Throughout the early years 
ff Title I, evaluations of a kind were- performed by those receiving funds. 
However, by the early 1970s it was apparent that compilation of these data 
into a summary of educational- achievement on the part of Title I students was 
impossible. In short, there was no way to measure the impact of a program, 
which over an 8-year period had cost over $10 billion (IJarnes & Ginsberq, 
1978). . - . 

, In 1974 Congress passed Section 151 (now Section 183) of Title I which ' 
required the development of standards for evaluation which would yield 
comparable data across proqrams. Technical Assistance Centers (T^C) were also 
established to assist State and Local Educational Agencies in the 
implementation ot such evaluation (Stonehill & Erjqlish, 1979), Shortly 
thereafter the Office of Education contracted with RMC Research Corporation 
under competitive bidding procedures to develop the Title I'Evaluation an<J 
Reporting System (TIERS). 



\ ' f " , , 



10 



RMC developed three basic models, of evaluation. Each includes pre- and 
posttesting, a method for generating gain estimates in the 'absence of a Title 
I program, and procedures for converting test results into "NORMA! CURVE 
EQUIVALENTS" ,(NCEs) to permi^ ag^gation of data across projects. • These 
models are 'designed to answer the, question, "How much more did pupils learn by' 
aarticipating in the Title. I prcfject than they would have learned without itf" 
(Tallmadge & Wood, 1976). However"*jn^j^IERS' inception, increasing numbers 
of evaluators have questioned whether oVnot the system works. 

From the beginning TIERS has elicited heavy criticism as well as support. 

r 

There is no consensus on its use, app.1 fe'abi 1 ity, <jr" validity. There appear to 

f 

be three types of concerns with the* system. The first is whether the 
statistical assunptions of each model are valid. The second is whether proper 
implementation of a^qiven model can occur at the local level-. The third is 
whether the models are comparable. Currently, there is no integration or 
analysis of the literature pertaining to these concerns. Since Title' I 
programs cost^bi 1 1 ions- of tax dol-lars aVid^affect millions of children, valid * 
evaluation is essential. This crifical review will .examine the TIERS models 
by reviewing the literature which assesses TIERS.- The results of the review 
should be useful in determining whether the system is valid as. an evaluation 
system; ' - - " 0 

. MODEL DESCRIPTIONS 



< 



All three eval uat ion models- use % the same definition of treatment 

effect/ Specifically, the project 's. impact is the actual post treatment * 

» 

performance minus the expected no-treatment performance,. The models differ in 
how the no-treatment expectation is generated, but all models' use pre- and 
I % posttestinq J:o .determine treatment effect. 

"V- . .. 



A ! 



.ERIC 



ii 



Model A (the norm referenced model) uses test publisher's norms to 



generate the no-treatment expectation on the assumption that the group's' *, 
rankinq on -the pretest wou'KJ be maintained on the posttest, if there was no 
.treatment; Model B (-the*contrpl group model) u^es a control group which 
theoretically receives an identical education, as the treatment qroup except 
that they are not Title I project participants. In Model C (the special 
regression model), the no-treatment effect is derived by f.inding the treatment 
group's mean pretest score on the comparison- group's post on pretest 
reqreSsioh line. Each* model can be' used with either normed (type 1) or 
non-normed (type 2) tests. When non-normed tests are used, a normed test must 
also be administered in order to convert 'measured gain inta NCE units. 

The following sections overview each model and the associated ,« . 
procedures, 

ModayA. When nationally normed tests are, used, the no-treatment 

expectation is the percentile status af the treatment group at the time of the 

ft ~ 

pretest. This model assumes that if there is a cfein due to treatment, the 

group wi*l1 rise in percentile status. If there^is'' no- treatment effect, the 
percentile status will remain the same. The observed posttreatment 
performance is the percentile status of the group's mean posttest scone, while 
the expected no-treatment performance is the pfetest percentile. 

Whgn non-normed tests are QSed, a nationally normed test must be qi'ven at 
pretest time. Score equivalencies between non-normed pretest and normed test 
are determined. The median scores on the non-normed posttest are converted to 
normed test counterparts 'on the- basis of these data and this fiqure is ■ 
converted to a perceru/Te and becomes the observed posttreatment performance. 
Ag&jn, the pretest percentile is the expected no-treatm£nt^performance. 

r . 



ERIC 



17 



12 

$i nee all' calculations are performed on the basis of thjs publisher's V> 
norms, several, important factors must be 'observed. 'First, al^ testing (pre and 
post) must be done on dates within two weeks of the empirical norm'ing date 
for the test or within six weeks, if publisher's norms ar^ adjusted. Secondly, 

♦ 

the qroup's composition must be similar to that of the population used in tlje 
norming sample. yThird/ testing must be performed exactly as done for the 
norminq sample. Fourth, to avoid statistical regression towards the mean, the 
pretest may not be used to select students for the proqram. 

Model -A is the most frequently used model. It is the easiest and cheapest 
to implement. In both Model B and C, the calculations are more complex and the 
cost is much higher^ since control qroups must also be tested. 

. Model B. With this model, acontrol group is formed. The control group's 
posttest percentile (Model Bl) or mean raw score (Model B2) is the expected * 
no-treatment performance whereas the treatment group' s posttest percentile or 
mean raw score is the observed posttreatment performance. 

The pretest scpres verify the groups' equivalency. If th^e scores differ 
between qroups, two statistical techniques can be used to adjust for this 
equality. Where random assiqnment of a ,populat'ior* into groups is used, 
analysis of covariance is used for adjustments^ When the groups are more 
appropriately §eqarded as* samples "fromtjwd different populations, the principal 
axis method of adjustment is uSed. 1 f ^ , 

Again,— several factors must be observed if the model is to be properly 
implemented. First, both group's must be tested atf the same time, in the same 
manner, with the same tes^t and level of test. Secondly, group composition must 
be similar in terms of socioeconomic status (€ES), sex, and race. Even though 
small systematic .differences can be adjusted for, large differences may 



13 

" invalidate the model. Third, with the exception of th$ Title I treatment, the* 
educational experiences of both groups must be the same. 

" fadel B is difficult to implement. Appropriate control qroups are usually 

not available, since it is usually impossible to randomly assign children to 

* 

treatment and control groups. Moreover, very few ^.EAs have large groups of 
eligible children who are no,t receiving Title I services. Model C take? this 
factor into account.- ^ 

Model' C . With this model, participation in the program is based%i a 
strict cutoff score on the pretest. All those above the cutoff form the 
compafison^foup and aM those at 'or below form the treatment group. 4 
Post-on-p'retest regression lines are calculated- separately for each qroup. The 
treatment group \$ line represents .the observed mean posttest performance 
corresponding to various pretest scores, j.e., the observed posttreatment 
performance. By projectinq the comparison group's line below the pretest 
cutoff score, the expected no-tf eatmen^^^fDrmanc^ is obtained. The actual 
treatment effect is the difference between the lines measured at two points: 
the treatment group's mean pretest score; and the cutoff score. » 

This model requires that the pretes't-posttest relationship is linear. The 
two measures must be highly correlated, and no floor V ceiling test effects - 
;may be present which could create curvil inearity. w Furthermore, the pretest 
must be u^ed.as the sole basis of selection. 

REVIEW OF THE LITERATURE PERTAINING TO 
- MODEL ASSUMPTIONS' 

» * 

For each model, a set of assumptions exist. These 'assumptions can, be 
classified into two categories: '^hose held in common by all, three models and 
those specific to a. given model. Although a l$rge body of literature has 



ID r 



■1 



examined the assumptions made by* JIERS / and discussed the various problems and 
biases iritVoduced upon violation of a'given assumption, this literature has 
not been effectively summarized,,; The followinq section will detail the 
a,ss«umptiorts and examine the pertinent literature regarding assumption 
violations* m - \ 



.A. ASSUMPJIONS COMMON TO ALt-MODELS 



' Qual ity "ccmlrol regarding test administration and data reporting is 
probably the single largest assumption affecting implementation of TIERS. 
-/^Without quality control, all of the other assumptions are irrelevant. 
Quality control is defined as the accurate collection and reporting of all 
4 data necessary to implement any TIERS model. Therefore, it enters into 
the system at numerous places and can take many forms. The fallowing dis- 
cusses the various places where quality control affects the system. 

^ 



Selection of Tesis 



Several decisipns must be made when a test is beinq selected to 
evaluate a Title' I program. First, does the test measure'the curriculum 
being taught? Secondly, should a normed or non-normed test be used? 
Third, which level of the test should be administered? Finally, when 
should evaluation of the* program occur? I 

Currlcul urn-Test Matching 

On the surface,^ it would appear as'though the objecfiv^s of Title I-- 

V 4 

to teach basic reading and math skills, to educationally disadvantaged 
children—would allow for the use of the same instrumentation nationally. 
This. would permit exdtt comparability of results, the area where many of 



2'J 



15 



the problems of. TIERS are encountered. However, as Tallmadge arid.Horst 

C~ t * ; * * ' 

(1978) We pointed out, the'basic sfe.i 1 1 s ofteoXequire several year's' to 

' •' «\ 1 / 

acquire (e.q,.; rjeadinq comprehension), and an evaluation system rgjjst • 

either ajlow for the complete acquisitionflor' caof ine evaluation to what is 

being taught during the specific period under evaluations 

Instruct ion afl programs tend to be focused an some subset 1 of the basic 

* • 

skilK^Thus, if testinq occurred over the. entire collection of subsets, 

gains one very specify area might not be detecl&j. fyr fexaraple, suppose 

Distr f i^ft A's reading program focuses on vocabulary ancf uses a te^ which 

emphasizes ward attack skills--gatns irr vocabul ary might go undetected. 

However, if a different achievement test were used (one which* emphasizes 

vocabulary), qains might be observed. r Usinq a test specific to the 

curriculum, appropriate evaluation ehn occur; since, the more closely a A 

test corresponds with the skill bein^taught , the greater the likelihood 

that student gains can be detected (F&gan & Horst, 1978). ' ^ 

While fitting a test to a c'urrip ul iHTttray sound easy, specific analy-^ 

si? of, for example, a readiruf achievement test injto its component subsets 

is seldom done. The importance^ of such an analysis was demonstrated by 

Porter, Schmidt, Floden, and Freeman (1978) yvho cl assif ied the items of 

the mathematics subtests of four standardized achievement' tjests /SAT, 

ITBS, MAT, 4pd CTBS) injo several factors including the nature of the 

material. -Jheir analysis revealed majo^differences between the tests. 

The*major objection against the use of the same .instrument nationally 
v - ■ > 

is that it would not allow Local' Education Agencies (LEAs) or State 

Edutation Aqencies. (SEAs) 'sufficient control over their curricula. LEAs 

y#ould be tempjted to design curriculum to fit 'into the evaluation sySten" m 

rafffier than teaching to ^ students' needs, 'if the LEA foc^ed 



21 



ERLC 



16 

instruction for so^e students in areas not included on the national 
test, gains ih these areas would go unmeasured. Furthermore, for the vast 
majority of ^educators, anything which woiftd reinforce vthe concept of a 
, nationally mandated education curriculum is distasteful and threatens 
educational and curriculum progression. { 
' The aspect of curriculum-test match is vital^to all the mod**?. If 



test selection is done without" considering the curriculum it yi to 

* i \ / 
evaluate, serious, threats to internal validity occur, "tests must. not be 

chosen because the LEA possesses a copy, or because it l^iTiexpensi ve, or 

because the administrator likes a'particular test, but because the test 

fits the curriculum. 

Normed vs. Non-normed Tjests 

The second problem of tejt selection is deciding what type of test to 
use: normed or non-normed.- All models have provisions for the use of 
non-normed tests provided that a*ndrmed test is qi'vea either -at the t% 
)f the pretest (Model A2\^fr at some tine during the period of ' evaluation 
(Models B2 and C2). ~ ~ 



The use of a non-normed test creates several problems for the 

evaluator. First, there are additional costs since a normed test must 

also be qiven to establish gain estimates in terms of'NCEs. These^costs 

would be least in Model A, since only the Title I populati&n is tested, 

and greater in Models B and C, where both' treatment and control comparison 

groups atso have* to be tested. The second problem is one of within model 

comparability, e.g., will Model A, implemented, with oorraed tests, (Model 

Al), yield the same estimates on gain as if implemen^eTwith non-normed 

• • a ' ■ 

tests (Model A2)? Third, non-normed tests may be overly restrictive and 

measure a very small subset of skills which do not&re/ lect actual gains 

» m 

/ 22 

j . T - 



c 



17 



„ made in the total 1 area of basic, readipg, mathematics and/or language, arts 
skills. HoweveP, some authors (Tallmadge 4- Worst, 1977; Arter & Estes, 
1978) suggest that the benefits of using a non-normed test may outweigh 

• the difficulties. These benefits include greater curricula sensitivity 
and greater gain sensitivity. 

A non r normed test is usually a cr iter ion-referenced} test (CRT)., A 
criterion ferenced^test is^particularly -appropriate whfen gains are to be 
measured in isolated skill areas, whe/e a norm-referenced test (NRT) may. 



be insensitive tc 



)ecificity of the curricula. Additionally, 



> 9 



norm-referred tests »ayYe insensitive to the small gains n\ade in a 
gjven project even though-the^T^ains could be educationally important 
(Tallmadge & Hirst, 1977). Finally, in the case of Model A, local norms 
m^tf differ from national norms. If the local population is below average, 
a year's worjh of gain for that population might be less^than\ year's 
gain >in the normftig population. Those classified as educationally • 
disadvantaged within the local population would be a^ even greater 
disadvantage when their gains were' compared to* the natiflhal norms (Arter & 
Estes, 1978). 

One aspect of the above w^s investigated by Long; Hofwitz, & DeVito 
(1978). While the study was specific to Model B, their conclusions are 
applicable to all models. National group standard deviations" were 
estimated in a local criterion-referenced test using the local group's 
standard deviations on the nVmed and non-normed testsi This assumed that 
the ratio of local to national standard deviations was identical for both 
tests. Their results. demonstrate a higher estimated standard deviation 
♦for the national norm group than the actual national norm group reported* 
by the test publisher. The authors noted that it is uncertain whether the 



ERIC 



23 



18 

error of; estimation -Is systematic or random, and that until this issue is 
clarified, the use .of -non-normed tests tp evaluate Title I projects is 
questionable. They suggest a system for establ .ishinq confidence intervals 

around the estimated standard deviation. 

7 - 

Comparability Within Models * v 4 

Fishbein (1978) pointed out that if the CRT scores are not normally, 
distributed, the linear transformation of these scores into NCEs will not 
t result in a normal distribution. .Since the resulting NCE distribution 
will be a non-normal curve, and the NCE'distributi.on for the normed test 
is a normal curve, the conclusions based on these two distributions may be 
different. Even if the same general model (e.g., Model B) was applied to 
the same population tested on normed (Bl) and nin-normed (B2) tests, 
different results might be obtained. Studies by Fish (1979) and Storlie, 
Rice, Harvey, and Crane (1979) have attempted to make direct within model 
comparisons. 

♦ /~ 
Fish compared v ,Model Al to A2 and found that both yVeldeaV si mil ar 

estimates of gain as long as the two tests measured comparable skills. 2 ' 

Unfortunately, Fish did not include enough information to critically 

evaluate the methodologyyjf the study or the comparability of the tests. 

The User'.s Guide specifically states that unless a high correlation (r*.6) 

4 

exists, it is inappropriate to use the non-normed test, since NCEs are 
yieWed by extrapolating through, the normed test. This correlation wa/< 
problem in the- study by Storlie et al. (1979) which attempted to compare 

Al and A?. In this case/the authors decided that a correlation of 56 

/ 

was too low to justify equating the tests. The authors recomnend a 
simulation study due to difficulties encountered using empirical data. ' 
This raises an additional problem with using a non-normed test. The 



24 



19 ' 

* « 

User's Guide presents no information for how to" salvage an evaluation ^ 
which, after all of the data have been collected, has a correlation lower, 
-than .6 between the NRT and^CRT. This means that it is not implausible 
that an LEA wauld commence their TIERS and discover too late to change 
directions that the correlation between the normed and unnormed test was 
too low. 

Content tested by CRTs. The fi/ial problem, the narrownessNrf content 
tested by 'most CRTs, is discussed by Fishbein (1978). This may- afecTbe 
viewed as a curriculum- test match problem. If objectives of a program 'are 
narrowly defined and laVge gains are demonstrated, using CRT. instrumenta-^ 
tion, one would conclude that the Title I pupils had shown-more growth 
than they would have Without Title I. 'However, in terms of treatment 
effect, it may be difficult to determine whether or not the overall 
objectives of Title I were met. For example, even though the evaluation 
may demonstrate that large gains have been m&J^in phonics, the students 
may not be better in general reading skills than they would have been 
without the- program—phonics skills^ay have been improved at the expense 
of more general reading skills (Tallmadge & Horst, 1977). The problems 
created by the narrowness of CRT content can be 1 reduced to a problem of 
policy vs. program objectives (Fishbein, 1978); where, a policy objective 
(e.g., the^ production of better readers) is more difficult to evaluate 
than a program objective (e.g.,' the acguisition of phonics skills). 

In sunmary, the use of non-normed tests appears to be unsupported by 

the literature. Despite attempts by several authors (Tallmadge & Horst, 

1977; Arter & Estes, 1978) to point out existing benefits, previous 

* 

research has not demonstrated that the benefits outweigh the difficulties. 
The major difficulties, in addition to increased costs, lie in three 
areas: lack of within model comparability, lack of a normal distribution 
of CRT scores, and the-narrowness of curriculum Jb^sted by most CRTs. 

25 



Out-of-Level TfeSfinq . 

The question concerning which level of a given test rsjyst appro- 
priate has generated some discussion .in the literature (e.q., Roberts, 

■ 1978; Ozenne, 1978; Johnson & Thomas, 1979). The correct level of a. test 
is ohe on which the fewest children -score at either' chance level or at the 
top score (Johnson & Thomas, 1979). Testing "out of level", in the case 
of Ti£*le I projects, means testing at the functional level of the child. 
Testing a Title I child enrol led "^n the fifth grade at her functional 
level might me.an using the third grade le\el of the test. • 

Use of the incorrect level of a test can affect evaluations in tW 

. ways. If a preponderance of children scotfe at chance level on the 
pretest, t*ie pretest average is artificially inflated since the children 
would have scored lower had there not been a floor effect. Upon 
posttestinq,. the oWrved gain would be smaller than was actually the 
case.'-T^ a large h^jmber of students were to score at the top of the 
posttest, a similar tStfect is seen: the visible gain is less than the 

aGtual gain since the student's posttest level is-actually higher than the 

j 

" test was able to measure. Since the test publisher's "recommenced" Tevel 
is not necessarily appropriate for students with very low performance ^ 
'levels, as would be the case in Title I classes, out-of- level testing 
(i.e., the use of a test level other than specifically recommended by the 
publ isher)- is often employed (Johnson & Thomas, 1979). 

# 

Out-.of-level testing most seViously affects Model A,' for which.there 

r \ k 

is no control group, but Models 'Biand C can also be affected. For 
h example, if the treatment group in Model B "tops out" o^ the posttest, 

comparison to a control group in which there is no ceilinq effect will not 
C show true gains. With\>del C, a change in level between pre- and 

••»'.' 
< . ' 26, • ' ' 



» • . ^ -21 

posttfe^ting could affect the regression lines, since even minor floor or ' 
ceil;ng effects on either test will re^lt in curvilinear regression lines 
making interpretation difficult -(Estes & Anderson7^?8.; this aspec^will 
be discussed more fully under Model C assumptions). 

RMC Research Corporation realized th« difficulties caused by floor 
and/or ceiling effects in out-of-level-testing. Roberts (1978) suggested 
a formula for estimating 3*ie occurrence of floor and ceiling effects. 
Specific suggestions wer'e given for predicting wfcen these effects would 
occur as well as detecting the presence of the effects from score 
distributions. For example, if student score distributions on a given 
pretest predict that ceiling effects wilj occur on the postteU,' Roberts 
suggested that a higher level -posttest be o given, even though (1)/ this 
folates a recommendation in, foe use of the same leyel^pre- and posttest 
for Model A, and (2) there may tfe content' changes between the levels. 

. IroniGdlly, Roberts 1 (1978)\recommendations may make the 
implementation and interpretation ^'TIERS more rather than less 
difficult, taenne (1^78) demonstrated that two levels of the CAT, both 
recommended 0 for fourth graders, yielded different results (31st percentile 
vs 25th percentile on the average) when given to a group of fourth graders* 
split into random halves. Depending on the test level given, different 
results would be obtained and different placement might occur. 
Additionally,' while a student's status on one level was fairly constant 
across Ume, Jftatus changed if tested on the other level. Ozenne (1978) 

commended that ev abators avoid changing test levels from^pre-* to 
posttest. If-change is unavoidable, 'Ozenne (197^rfecomrTWfTided double 
testing at posttest to adjust' .for any di spar U^bet ween the two test 
levels. 



22 



Seyeral authors have attempted to. compare students 1 gains by using 

both at-level *nd out-of-leveL tests (Slaughter & Gallas, 197B; Ozenne, 

\ 

1978; Long, Schaffran, & Kelloqg, 1977; CrOwder & Gallas, 1978; Powers & 

Gallas. 1978). The results of these studies are equivocal, not only 
' t " <\ 
- acrols studies but within studies. 

Long, Schaffran, and Kellogg (1977) attempted to determine whether 

elementary school Title >I students would recei^ tbe same grade equivalent 

(G.E..), scores when tested in level and out of level. The results showed 

that at the second and tVird grade levels, in-level tests resulted in 

lower G.E. scores. and more students eligible for Title I than If 

out-of^vel tests wereused. At the fourth ^jr^de le»el, the opposite was 

true. For example, on vocabulary, 77% of second graders and 88% of. third 

graders would have been eligible for, Title I on the basis of in-lgvel 

testing, versus 54% and 66% (respectively) if selected by out-of-level 

# testijjjCK For students in fourth grade, however, eligibility on the basis 

of in-level testing was 77%, while 89% was eligible based on out-of-level 

testing. For all grWes* out-of-level testing demonstrated greater gains 

than in-ffevel testing. 

Murray, Arter/and f addi,s (1979), in commenting on Long et al» 

(1977), stated that Comparable results would only be valid if both levels 

were appropriate, whkh w *s not the caSe. Additionally, they stated that 

the probable reason for differing G<E. Scores was that students tended to 

score jpit. choice levelNn in-level tests, but not Qp f out-of-leyel tests 

which were suited their functional level. While this may account for- 

♦th* results of grades 2 and 3, it does not account for the results of 

grade 4. Perhaps, ^better explanation lies in the material' being "tested 

and its comparability to the students' curricula. For example, if second 



9 

ERIC 



2S 



V 23. 
and third grade Title I students were being, taught at functional level, - 
they woul'd have performed better on the out-of^level test, while if the 
fourth grade Title I students' curriculum was matched better, with the 
at-level curriculum, they would have performed' better on the at-level 
test. ^ ^ 

The problem of curriculum- test matching is also apparent. in the 

Crowder and Gal las (1978) study which investigated whether standard scale 

scores are comparable for in-level a>vd out-of -level tests, and whether the- 

floor effect of in-level testing was evident in out-of-level testing.' The 

f 

authors suggested that for students scoring at floor level on a higher 
test level, the lower level test would give lower average standard scores; 
for students in the functional range of each test, the standard scores 
would be the same; and for students at the ceiling on the higher test, 
standard scores woul'd be at the ceiling on the lower test. However, these 
patterns are not evident in .their results. While the authors explained 
tKat these results were due to the relatively large increase in standard 
scopes at tbe extremes of the scale transformation, Murray et al . (1979) 
suggested that- the differences were due to differences in curriculum 
between the test levels used. 

Powers and Gal las (1978) Examined students in fourth, seventh, and 
ninth grades tested in level and out »f level. While fourth graders 
attained higher-percenti le rankings on the in-level test (in-leve.l 27X, 
out-of-level 16*), seventh and ninth graders attained higher rankings on 
the out-of-level test (in-level 16th and 23rd percentiles respectively, 
out-of-level 13th and 16th percentiles respectively). However, no 
significant differences were found 8 in expanded standard scores. It can be 

i 

concluded that out-of-level tests may be more precise, even thouqh both 

* • „ 

tests provide similar information on sti/dent status/ 



2D 



One particular problem which has not been addressed in the 
literature Is the use of ottt-i>f-Tevel tests with Model A, the norm - 
referenced model. Since publishers' norms on a given test are for a 
particular population (e.g/; third graders), .us.ing that test on a 
differertt population (e^-., fiftfi graders) creates difficulties in norm 
referencing. If the in-level '(e.g. , fifth qrade) test .was used, a floor 
effect might be seen, so the^evaluator j| caught in a double bind. Yet, 
nearly all of the existing literature recommends testing at functional 
level* i.e., out-of-level testfnq, Certainly, this aspect of out-of-level 
testinq requires further study. 

In summary, there dbes not appear to be a stVong data base supporting 
out-qf-level. testing for Model A,' although much of the ]iterature condones 

0 

its use. The suggestion on the part of RMC Research Corporation that CRTs 
be used with Model A is particularly alarming since norms are applied to a 
different population than that of the norminq sample. More research must 
be 'done in this area. With Models B and C, it is clear that major 
problems can occur if the test level is changed from pre- to posttest, as 
the content levels may differ. If a different level of test is 
unavoidable, both levels', should be given at the posttest in order to 
adjust for disparity between the levels. Obviously, since Models B and C 
have local control/comparison groups, the problem of out-of-level testing 
is not as great as with Model A, provided that both treatment and control 

♦ 

groups receive the same level of the test and that the test level does not 
changjg from pre- 'to posttest. .However, it is critical to avoid floor and 
ceili,ng effects fn either Group/ - ( 



- ' 25 

4 > * 

Time of Testing . 

The l^t issue of test selection is the Hate of testing. *0ne aspect 
of this will be discussed under'the assumptions of Model A, which requires 
that testing be done on the same dates as the publisher's norms. The 
other aspect, that' of the length of the evaluation period, applies to all 
model's. The question is whether pre- and posttestinq shou>d be done on a 
yearlybasis (i.e., test once a year so that last^year's posttest is this 
year's pretest), or on the basis of the academic year (i.e., pretest in 
the fall, posttest in- the spring). An important question is whether loss 
or gain occurs over the summer v,acation andUnether Title I students lose 
more or gain less over the summer than non-Title I students. 

One difficulty in comparing spring- spring vs. fall-spring testing i;s 
that most publishers 1 norms* tend to bfe based on spring-spring testing 
(Conklin, 1979). This means that to obtain norms for fall performance} 

,y empirical norms must be interpolated (unless empirical norms exist for 

/ - ' 

, both fall and spring., which is the case for some tests). Conklin (1979) 

A 

|f an<^DjeYito and Long (1977) criticize such interpolation since iKusually 
involves the equating of different \test forms or levels and assumes that 
different grade levels have the same pattern and rate of growth. 

~ The User's Guide doe< not give instructions as to how interpol tation 
is to be performed. Inappropriate interpolation may have been a problem 
in at least one study as Murray et al . (1979) have commented that % 
incorrectly estimated norms are an v alternati ve explanation for a South 
Carolina study in which actual gains we&e greater -in fall-sprinq testinq 
than in spring-spring testing (ESEA Title I Annual Eval uation' Report: FY 
1975,. Office of Federal Programs, South Carolina Department of Education, 
Columbia, $.C\, November, 1975). 



ERJC . 31 



DeVfijttf and'Lonq compared the effects of sprinq-spring vs fall-spring 
testinq of educational ly, disadvantaged studefits on # evaluation results-- 
, They found significant declines 1n percentile rank between spring and *f a 11 
testing in most elementary grades (e.g., second graders: spring 33rd 
percentile, fall 12th percentile; fourth graders: 20th ^vs 11th 

« V e 

percentiles). Unfortunately, QeVito and Long did not hqld test, form 
and le\rel ^Constant in 4 all cases. Differences in content between^ levels 
could easily account for these results: 

Faddfs and Estes (1978) compared' 'treatment effects for fall--fall andt 
fall-sp.ring evaluation cycles using Model (H No treatment effects were . 
found for either cy^e nor were* ther^ differences'between results 
depending on the cjJL However, tCe authors cautioned that these result^ 
are only suflqestive since a high attrition rate was apparent, interpolated 



norms were used for fall ncvrns,. and summer losses seen in other studies 
were not apparent. ^ 

In addition to the question of summer vacation declines is the issue 
of whether evaluation should concern itself with only the school year or 
with long-tprm effectiveness. David and Pelavin (1978) maintain that a 

program should demonstrate sustaining effects over the sunnier. They 

\ • 

demonstrated that summer losses* can be significant and recommended that 
evaluation be done orMk.f all -fall basis so that program .effects over 
^jnmer can be ensured. However, mere evaluation on a yearly basis does 
not.ensure long-lasting program effects and, as evinced by Faddis and * 
Estes (1978), high attrition may preTlud^ evaluating on a yearly basis. 

In summary, different evaluation peridds may prodace different 
^estimates of treatment qain. This is due to two factors: use of 
interpolated fal^rortns and possible differential summer loss or gain 



27 

between treatment and -control groups . It- appears as though summer losses 

can be significant and linearly interpolated tall norms may not reflect " 

this loss. " • 

Two recommeqdati&ns can be made. The User's Guide should specify 

norm interpol ation 'method, since different methods yield different 

results. Secondly, as suggested by Murray et al.*(l979), fall-fall, 

* 

spring-spring, and f al 1-spring .test -data should be separated/ 
-Different ial summer. growth effects would then be easier to detect. 

Test Administration, Scoring, and Analysis 

* 

Specified procedures are supposed to be' followed in administering 

all standardized testsy/ These procedures m$y include a wide variety of 

features such as specific wordifigjyi various items, basal and veiling 

> 

level detection, timing, ancj, the use of practice items. Testing 

conditions (e.q., qui£t, well- lit room) are usually specified. Ofter\, it 

is recommended that the test should 6e given only by those individuals 

trained in its- administration. Deviation from the instructions can 

} 4 
seriously alter students' test scores. This is particularly a problem 

with Model A, since scores must be comparable to the norm data (Horst & 

Wqod, 1978). For Models B and C, testing conditions, including setting 

and time of testing, procedures, and administration must be comparable 

between treatment and control groups (Tallmadge & Roberts, 1978). 

Following test administration, the dilemma of scoring the tests K 

occurs. Usually, scoring is not done by hand but by a computerized 

scoring service. Following raw score determination, gains must be 

converted into NCEs. The conversion for each score requires several 

steps, usually Including conversion to the publisher's standard score, 

then to the national percentile ranking, and finally to an NCE. If a 



33' 



rton-Ttormed test was used, additional- calculations are necessary. At each 
conversion step, errors'in calculation and transcription may occur. While 
j/Ae scores and conversions performed by scoring services are more accurate 
than manually tallied and con.ve.rted scores, the cost of such services can 
be substantial . 

Analyzinq the data, i.e., determining the effectiveness of an 
individual project, involves several steps,. Each model prescribes a 
particular analysis technique, which can range from simple -(Model A) to- 
comblex (Model B antfC). Following analysis at the. LEA, evaluations are 

* 

passed on^to the SEA which aggregates the data! 

Several authors have discussed what may happen when appropriate 
procedures are not followed. Johnson and Thomas (1979) Jisted problems 
that can occur prior, to' scoping including lost answer , sheets, improperly 
coded answer sheets, and matching pre- to posttests. Stonehill and 
English (1979) specified three types of errors which can occur: 
arithmetic (e.g., incorrect computations), procedural (e.g., use of 
inappropriate norms tables), and clerical (e.g., incorrect data 

n 

transcription). They concluded these, errors result in overestimates of 

' % 

program effect, since positive gains are rarely as well scrutinized as 

» 

nuTl or negative effects. Finally, they estimated that more than 9*5% of 

districts not using computer scoVing and data processing will make some 

errors. \ 

Stonehill and English call for greater reliance on computerized 

processing systems. However, Taylor (1981) coimiented that these also have 
.drawbacks. Encoding of answer sheets is rarely checked, so that correct 

answers are marked incorrectly. Miscellaneous accidental marks on answer 
■sheets, which do no%" affect hand scoring, may affe'ct computer, scoring. 



34 



Finally, When "computerized scoring and analysis- is performed, teachers are 

• , <. 

further removed from the evaluation and less, involved in TIERS. 

In conclusion, there appears to be no panacea which will eliminate 
errors in test administration and data collection. Unfortunately,, errors 
in these areas can substantially alter the outcome of the evaluation. 
Training teachers in the whys and hows of evaluation procedures may 
contribute substantially to a reduction in error, but unfortunately 
imposes duties upon those who may have the least time "avai lable. 

Summary 

In summary, the general assumptions for all models have two basic 

types of problems', both issues of qual ity control. Technical problems are 

P « 
associated with test selection and evraldaiion model implementation, while 

•„ — 

processing errors occur in the scoring and calculation of results. It 
appears that the majority of these problems are due to- two factors: 
(1) the flexibility of the system, and (2) the nutiber of steps required to 
complete a given evaluation, i.e., the greater the numfter of steps, the 
greater the nunber of errors. 

The problems of the system are cafi$ed in part by thejtandate that 
TIERS be adaptable to any .Title I project. Since the curr*cul urn .focus of 
a project is decid* by the LEA/SEA, the evaluation system must beetle to 
be used with any curriculum. Despite 'assistance given to the LEA/SEA by a 
Technical Assistance Center (TAC), these problems must in the end be 
weighed by*the Vocal evaluators. Inappropriate test selection can. result 
in an inva/id evaluation. 4 

Unfortunately, it may be difficult at the LEA/SEA level to determine 
if the evaluation is valid. The fit of a curriculum to a test may npt be 
good, resulting in ,no measured gain when gain existed; or may be overly 
specific, resulting in gains for too narrow an area which would not be 
reflected in improvement in the Basic skill area. Floor and ceiling 



- . - 31. 

that without treatment, the students would hive the same percentile 
ranking on the posttest Is they did on the pretest. Any percentile chanqe 
therefore is, attributed to the Title I project.* f 

.There are two general difficulties- with the equipefcentile assunp- 
tion. First, qrowth rates of the publisher's norming population may be 
greater' than those of the Title I participants.. In other words, Title I 
students may "lose ground", over -time resulting in a lower no-treatment 
effect than predicted by the equipercenti le assumption. Secondly, ttie ' 
norming population may not be similar to the Title I population in terms 
of such factors as minority composition, age, or SES. 

y 

t 

Growth Rates * 
1 

There are several' reasons why maturation rates may change over grade 
levels. Kaskowitz and Norwood (1978) examined gains necessary for normal ' 
growth on the MAT and compared them with gains necessary for educationally 
significant growth (defined as increase over normal growth by "greater than 
1/3 of a^standard deviation). The rate of normal growth decreases over 
grade level, but the standard scqrg standard deviations increase over 
grade level. Thus, 1/3 of an SD at grade 1 fs a much smaller proportion 
of a yearns growth than 1/3 of a SD at gr^ade 7. Thus, educationally 
significant growth is more difficult to attain at higher grade levels. 

Virtually no literature examines this problem coupled with functional 
(out of level) testing. For example, a fifth grader tested at third grade 
level would be required to perform at a third grade growth l^vel which' is 
greater than that for fifth graders si/ice gains necessary for normal 
growth decrease ovfcr grades. However, educationally significant gains are 
less for third grade -than for fifth grade, so wh'il<e required growth may be 
greater, less is required for educational significance. In recommending 



functional level testing for Model A, TIERS must assume that these 
increases and decreases balance themselves out. ,But there is no 
literature* to support this unstated assumption. In other words, the * 
assumption is that a child function incj -at the third grade le£el is a third 
grader, in spite of age or previous difficulties in academic gain. 

^Cross-sectional Studies on the Equipercent ile Assumption 

The composition of a gr*up of Title I students may differ from that 
of the test publisher's norming population in a variety of ways. The most 
obvious differences which may be related to pcademic achievement are SES, 
age, and race A How important these factors are with regard to TIERS is 
still a question, but Redoes caution'^^evaluator to use norms fronrf^a 
"comparable" population. However, "comparable" Ts never defined, and no 
guidelines are given for making this determination. 

' Two obvious problems in establishing comparability are (a) when 
Title I studerlfts are included in the norming population wtfich would 
violatg^the assumption of no-treatment effect, or (b) if the Title I 
population is vastly different than the forming population. For example, 
Doher^ ( undated) , as reported in Murray et al . (1979), stated that using 
norm not based specifically Qn disadvantage^ low achieving students 
gives noj^valid measures of growth. To properly evaluate a group of 
students totally unlike those in national norms, ^standardization on the 
-appropriate population must be performed. 

Mayeske and Beaton (1975) attempted to relate background and school 
variable* to achievement. Msing cross-sectional data, they determined the 
percentage of various minority students placing above selected percentile 
ranks for white students. For bl acks," these percentages decreased over # 
grade level, while for other minority groupsf the percentages increased 



37 



33 

over grade level. While these data suggest that groups, may change their 
relative position over time in different directions than whites (th'us 

r 

invalidating the equipercentile assumption In some cases), the data are 
cross-sectional and may reflect population changes and differential 
drop-out xrates rather" than actual changes over time. Longitudinal data 
collection is usually necessary if conclusions regarding racial/SES 
differences are to be drawn. i 

Van Hove, Coleman, Rabban and Karweit (1970) examined jbercenti le 
ranks of achievement test results for 7 cities at grades 6 and either 
grade 3 or 4. Linn (1978) converted the global results reported by Van 
Hove et al. into NCE scores. With the exception of one city, NCE scores 
dropped- between lower and upper grades for nearly all minority schools. 
The same effect was seen in 5 out of 7icities' nearly all majority 
schools. Excluding the gains seen in 3 cities (difference between lower 
and upper grades: .5 to 4.4 NCEs) the decreases for the nearly all 
minority schools (-.7 to -7.7 NCEs difference) were greater than the 
decreases for the nearly aTl majority schools (-.5 to -4:8 NCEs 
difference). These studies raise doubts about the usage of the Constant 
NCE as a*no treatment expectation, even though the data are 
cross-sectional, and the level of test wasnrot constant across grades'. 

Coleman, Campbell, Hobson, McPortland, flood, Weinfeld, and York ' 
(1966) compared the status of various minorities with the highest scoring 
group (white,, urban, northeast population) using standardized scores on 7 
the STEP. Computing the number of SD units that each group was below the 
highest scoring group for grades 6, 9 and' 12, the futhors demonstrated 
that blacks and, Puerto Ricans' scores tended to fall on verbal ability and 
reading relative to the highest group (from .1 to ^.5 SD units), but rise o # n 



N. 

34- 

» 

math, tests (.1 to .5 SD units). Once again, the cross-sectional nature of 
the data makes it difficult to draw firm conclusions. However, all of 
these studies suggest a iftttern' of growth contrary, to the equi percent i le 
assumpti(>ft^_ 

Longitudinal Studies oh the Equipercentile Assumption 

Longitudinal studies are perhaps the best tests of the equipercentile 
assumption. Sincere same group is monitored over time, variance due to 
true differences between populations, seen in cross-sectional studies is 
eliminated. Additionally, the need to ensure identical testing situations 
and identical race/SES composition is removed. Studies which have 
longitudinally examined the equipercentile -assumption have concluded" that 
it is valid in some cases, but not others (Powell, Schmidt, & Raffeld, 
1979; Kaskowitz & Norwood, 1977; Hiscox & Owen, 1978; Armor, 
Conry-Osequera, Cox, King, McDonnell, Pascal, Pauly, &Zellman, 1976). 
All-of these studies traced populations of students over at least one 
grade level to examine the assumption. Most have methodological problems 
which dictate caution in drawinq conclusions from the data. 

Part of the Armor et al . (1976) study traced achievement of over 700 
predominantly minority program students, over 4 years. The'results show a 
percentile increase between grades 3 and 4 but steady decreases thereafter 
(33rd percentile in gr>de 4 to 29th percentile in grade 6). The increase 
in the first year was attributed by the authors to a change in tests. 
However, the results of this study are confounded by several factors. 
First, the equi percentile assumption is assumed .only in the absence^f 
special programs, but most of the students were in special programs. . 
Since this study, overall, was an evaluation of the Preferred Reading 
Program, it is possible that treatment damaged the students or that the 



tests used did not properly match the curriculum. Secondly, there is no r 
indication of whether the same level of test was used for each pre- and 
posttest cycle. If this is not the case, the differences can be 
^attributed to a change in test levels. 

Problems with different test levels and time of testinq are sources 
of difficulties in a study, done by Kaskowitz and Norwood (1977). The 
study hypothesized that the equipercenti le assumption may not adequately 
describe Title I type* student performance in the absence of a special 
program. While it was demonstrated that such a population showed 
[percentile dlfcreases over 3 years with respect to the norm group, testing 
occurred at differing levels and off publishers' norming-dates. The 
authors "conftvded that; norms based on the standardization group will be 
too high foriEn educationally disadvantaged population. 

- Powell et al. (1979). analyzed pre-posttest scores on the MAT in a 
one-year spring-spring testing evaluation. .Three groups of pre-posttest 
cycles were examined longitudinally: (1) end of second grade - end of 
third gfade, (2) end of fourth grade - end of <fifth grade, and (3) end of 
second grade - end of fourth grade. Although the assumption of 
equipercenti le growth held for the reading subtest resuKs, the standard 
scores on the math subtest were statistically significantly different at 
'the .05 level from those expected using the equipercenti le assumption. 
Whether this can be transformed into similar differences in terms of NCEs 
is undetermined. The main difficulty in interpreting these data is, 
again, problems with the test leVel utilized. In all ^>f the cycles 
examined, the form and level .of the test changes from pre- to posttestlng. 
As mentioned previously, differences found when using different test 
levels may result from differences in content. While the evaluators 



r • 



■40 



,f y appear to" have assumed equivalency between test "forms/ level , probably 
because the MAT "was used throughout.^quivalency was not -'established. 
Differences in content and better curricul attest matchinq- could easily 
'Recount for the, resul.ts. : " ' 

'Hiscox and Owen (1978) also attempted to longitudinally artalyze*data 
to answer questions, regarding the equibercentile assumption: In this 
stufy, the authors carefully examined undents' achievement test scores 
and percentile ranking over 4 years for both Title 1 and comparable 
. non-Title I students. As, pointed out by the autho*f* the problems 
encountered in the study make interpretation difficult. F,irst, attrition 
over the 4-year period Was substantial. Af'the high school level, for- 
example, more than two-thirds of the students were "lost" by the fourth 
y^ar. Secondly, out-of-level testing .was widespread and the^fect of 

, this on expanded standard scores-or NCEs is-not knownA While several^*" 
groups show enough change ovpr a 3-year period to question the tenabJP^ 
,of equipercentile Assumption, the authors were unable to detgrmine whether 
real differences in. achievement, chanqes in $est .levels over the 4 yeai*s« 
or the lack/of J&mplete dat^ caused the percentile change. In 6yier 
words, whethe,$4^modeMi' ; e..,- the equipercenti le assunpt^n^-oc the 
" problems witCrthe data are at fault was undetermined. " r 

< .« > 

Summary *FZ 

In.st^apfring the literature on the equipercenti le growth 
.assumption, /Very few positive ^statements c#i be made since methodological 
problems artmund. . Instead of collecting their own data, "mo^ studies 

simply used data from previous testings. Consequently, test level for 
* * »■ 

pre- and posttesting 1s rarely constant. Without the same test being used 
• ' ■ t ' 

for pre- and posttesting, it is difficult *tp determine .whether any changes 



• * . ^ 37 . 

are iye to the level of test changing, or to true achievement levels. 
Additionally, there is no control for equivalency amortg testinq 
conditions. Secondly, it is difficult to determine if standardized norms 
are applicable for the group being examined: out-of^level testing would 
invalidate the norms due to age and. saturation effects, racial and SES 
composition may be drastically different, and testing may not have r 
occurred on/ nearjwrbl i sherds norms (Noggle, 1977). 

- Finallyy there is seldom any standard fqt* defining a I'Title I 
eligible student". Comparinq studies is difficult in this light, 
especially since several authors have suqgested th?t educationally 
disadvantaged students may proq*5ss at- different rates (Faddis & Estes, 
1978; Hiscox & Owen, 1978; Mfcyeske & Beaton, 1975; Van Hove et al., 1970). 
By having nd standard of performance for Title I eligibility, severely 
educationally disadvantaged students may be aggregated with mildly 
educationally disadvantaged students; Test norms would not be appropriate 
"for both populations since rates of- growth for the severely educationally 
disadvantaged would be, slower than for mildly educationally disadvantaged 0 
•stud^ts. • 

The appropriate study on the equi percentile assumption for Title I 
students has yet to be* performed. Such a, study would require pre- and 
posttestinq students selected on the basis of a selection test as - 
educationally disadvantaged. Test levels should be identical* for^pre-^ and 
post test sr f ^ar^U»^ I type intervention should occur. If this y # 
population demonstrated no declines or gains in percentile ranking, then 
the equipercentile assumption could be said to hold for Title I type 
students. 



42 



38 

Until such a study ^is performed, use of Model A must be coupled with 
strict controls. Whether -or not equi percentile growth occurred in the 
past may be indicative of current growth. As suggested by Murray e*t al . 
^(1979), the evaluator could examine this in one of three ways. Pre ^ 
Title I records could be checked for students who would have been selected 
•for Title I, and, if over a similar period of time this sample maintained 
their percentile status and the composition of the sample is similar ,to ' 
the current Title I sample, the equipercentile assumption may obtain. If 
such data is not available, the percentile status of the district as a 
whole could be traced over time. If it was maintained, it wo"uld also -lend 
support for the equipercentile assumption within this population, although 
this support would not be as" strong as with the previous method. Finally, 
if no historical data was available, a current local group of students 
similar to those in Tttle I could be examined for equipercentile growth. 

* 

This is the weakest method for eliciting support, for the equipercentile 

assumption. 

' v t ~ N * 

In summary, without local evidence in support of the equipercentile 

growth assumption, Model A should not be used. Without s*uch evidence, it 

would be difficult to make valid Conclusions regarding the data. 

■> 

c 

Selection Testing 

In order to be a participant in a program to be evaluated using 
Model A,-*election must be based on a test which is not used as the 
pretest. If selection is based on a test given prior to the pretest, 
regression towards the mear/lis expected tb occur between the selection and 
the pretest, and not between pre- and posttesting. However, regression is 
directly tied to the amount Qf correlation between two tests— the lower ■ 
the correlation, the greater the regression. Over time, correlation 



39 



between pre- and pokwht dfops, thus increasing the reqression that 4 will 
occur between these pre- and posttests, -even if regression has already 
occurred between select ion^d •pretest (GTass, 1978; Burton, 1978). Such 
regression will result in an artif iciall y .high estimate of qains due to 
academic achievement^ Carjpbell and Stanley (1966) caiTHan aqainst use of 
this type of procedure due ta such "pseudo' gains". While Yah (1978) 
demonstrated that selection tests ctfuld be used as pretests, the study 
was limited since it assumed a normal test score distribution. Real data 
may not be normal ly distributed. In ^symmary, the validity of the 



iiary 



assumption that regression occurs beWeen selection and pretest and 
no regression occurs between pre- and posttest is unsupported by the 
1 iterature. 

N D'ate of Testing 

For the standardized norms tp be ^appropriate when used to eval uate 
Title I studqgts unfl^r -Modefl A, the User's Guide stipulates that the 
actual date of testing occur within ^v^w^pk^ of the publisher's ,date of 
testing or six weeks if iriSfe^proT-ated porrfis are used. Pre- and 

• posttesting should be equaUy : ;distant from these published dates, i.e., 

- - < '\ ' • 

1f the pretest' i$, given four days- pripr to the< publ ished date, the 

? -J posttest should also be given four days (tfior to the^ublished date. 

As disci/ssed by Br id gem ari^ 978) ana Baker and, Willi ams-( 1978),. 

several problems are encountered* wjtii >nterRold£ion. The foremost 4 is 'the* 

manner of interpolation. One methc^r entails plotting of equal* standard 

scon* lines on a graph of NCEs versus dates. Another would be to plot • 

-* equal NCEs on a gpaph of standard scortfyersus dates. These two methods 

will yield the same, result* only if the standard scores are normally 

i . ■ . ■ 

ER|C * , 44 



40 

m 

distributed within the group at bo£h norming dates which would result in a 
linear rel ationship between standard scpre and NCEs* (Baker & Williams, 
1978). These authors recommend the former method on the basis that if the 
scores are not exactly normal, and given that rounding occurred during 
norm development, this method wil J yifld more accurate results: it 
requires Qnly one calculation whereas the lattjer^jnethod requires several 
calculations permitting more error to occur. 

Tile second problem with interpolating norms is whether to interpolate 
or extrapolate. If testing should occcir October and April, and instead 
occurred in September, one could interpolate using the previous year's 
April norms and the current year's October norms. However, as previously 
discussed, this type of norm will not account for summer growth 1 or loss. 
Extrapolation from the current year 's, Qctober and April norms is 
recommended (Baker & Williams, 1978). 

« 

Additional difficulties can occur when students are absent on the 

appropriate testing day. While mgke-up testing is mentioned infrequently 

in the literature, the problem of makeup testing and the equating of 

testing conditions may create additional problems in interpolation and 

interpretation. Finally, most of the literature pertaining to 

interpolation discusses the topic from the standpoint that it should only 

be do^e if there is no alternative. 'Indeed, careful advance planning in 

advance can eliminate the need to interpolate norms. However, as 
* 

discussed previously, interpolation methods should be specified by TIERS, 
so that results are comparable across projects^ 



v 



41 



Summary 



Jhere are serious threats to the internal validity of Model A in 

terms of possible historical, maturational, selectional, and instrumental 

tional differences between the norming ^sample and the*Title I sample. j An 

additional threat comes from the ^assumption that statistical regression 

will occur only between selection and pretest. Finally, the use of 

interpolated norms can result in bias estimates of ^expected gains. The 

combination of these-f actors make Model A the weakest of the three models. 

Unfortunately, it is the moh. frequently used model as it is the simplest 

and least expensive to implement. 

t 

C. • ASSUMPTIONS OF MODEL B 

Two issues are important for determining the validity of Model 8. 
The first deals with the appropriate control group, and the second deals 
with the appropriate statistical adjustment for non-equivalent control 
groups (Tallmadge & Horst, 1976). Model B is. the strongest of the three 
models in that its design, i.e., the use of a control group, is well 
supported in the literature (e.g., Campbell & Stanley, 1966). The 
problems which occur with this model, are not difficulties with the 
validity of the assumptions used, but difficulties with tneir 
implementation. Unfortunately; the use* of a proper control group is 
extremely difficult to achieve due to et+iical constraints as well as , 
definition, • ^ 



•j 

^4 



9 

ERIC 



4b* 



- 42 



Composition of Control Group I 

Model B calls for comparing 'performance of Title I participants to 
a no-treatment control •group. This control group must be selected using 
the same criteria used to select participants, and to receive the same 
educational advantages and curricula as Titlfe I participants except the 
actual Title I program being evaluated. Implicit in this model are, the 
factors of identical testing conditions, test levels, and racial/SES group 
composition. In short, all factors that could affect the performance of 4 
the Title I participants must be incorporated into the control group. 

Unfortunately, this ideal control group is difficult to locate. For 
example, ethical considerations preclude random assignment into treatment 
and no-treatment group^ SEAs can circjumvent th-is by having Title I and 
non-Title I schools, so that the control group can be drawn from* the 
non-Title I school. This can create some problems since Title I schools 
are usually chosen on the basis of greatest need. In the relatively rare 
instances wher*e students participate only half of the schoot' year, the 
students in the program during the fall can be compared to those who did 
not participate in the fall but in the spring. In such cases, random 
assignment can occur in terms of wl>o receives the program when. 

• Unless the contp«rS^^ is comparable to the treatment grtfup, Model 
8 cannot be used. If -implementation occurs using* similar but non- 
equivalent control groups, statistical adjustments are necessary. 
Unfortunately, the proper use of these adjustments is. difficult to 
ascertain, either in the materials produced by RMC, or in the evaluation 
literature (Goldman & Crane, 1980). The following section will examine 
the literature pertaining to such adjustments. 



< ' ' y 43 

Statistical Adjustments * 

Basically, two types of adjustment are available to the user of 
Model B. These adjustments are analysis of covariance and principal axis 
' adjustment^ Covariance analysis, which uses the slope of the common 
r within-group posttest on pretest line to adjust for the initial 

differences, is most appropriate when used with groups which are "random 
in effect" (Tallmadge & Horst, 1976). Principal axis adjustment uses 
the ratio of the pooled within-group posttest standard deviation to the 
pooled within-group pretest standard deviation. Adjustment is then made 
by subtracting the control group's posttest mean from the product of the 
principal axis and the pretest difference between the groups. - 9. 

Essentially, principal axis adjustment is analysis of covariance when 
the correlation between covariate and the dependent variable is set equal 
to 1.0. In other words, the use of principal axis is, appropriate if the 
groups exhibit stable differences over time. Kenny (1975) argues that 
principal ^xis adjustment is usual ly most appropriate since those assigned 
to Title I are usually the most needy students so the control group 
represents a population drawn from a radical ly, different environment. 
Differences, usually seen -in terms of higher . pretest means in the control 
group, may ca,use, serious threats to the internal validity of the * 
"non-equivalent group design since the difference may be due to dissimilar 
maturation rates and these may interact with selection (Campbell-* 
Stanley, 1966). . ' 

\ The use of Vie principal axis, adjustment is dependent upon stable 
fc differences between the groups being measured. This assumption is based 
on the fan-spread hypothesis wh"ich Vutes that the difference between 
group means is constant .over time -relative to the pooled standard 



44 

* 

deviation within groups (Kenny, 1975). A major difference between this 
and the equipercenti 1e assumption is that with the fan-spread hypothesis, 
performance of both groups is being measured -while with the equipercent i le 
assumption, the performance of a population which_may not be included in 
the norming sample is being measured against that norming sample. If the 
population being tested was in fact a subsample of the norming population, 
the equipercgntile assumption and the fan-spread hypothesis would be 
synonymous. 

Several authors have maintained that the fan-spread hypothesis has 
serious -weaknesses (Linn, 1978; Linn & Werts, 1977; Goldman & Crane, 
1980). Linn and Werts (1977) demonstrated that if the fan-spread 
hypothesis is untrue, standardized-gain-score met+fods 'can lead to biasecT 
estimates of gain. Goldman and Crane (1980-) used computer simulated data 
to examine the bias which results when different analytical techniques are 
used 1n different situations. With regard to the principal axis, 
adjustment, the four conditions important to this review were: ^ 

1. Random assignment, equal principal ax§s. 

2. Nonrandom assignment, equal principal axes,- pretest means and SDs 
unequal . 

3. Nonrandom assignment, unequal principal axes. 

4. Nonrandom assignment, unequal principal axes^equal pretest 
means. - 

Three scores were obtained for each condHrWh: unadjusted, covariance, 
and principal axis. Unequal principal axes violate the fan-spread 
hypothesis since a common within-group principal axis should not be 
calculated; • 



4'J 



J 



45 

r . 

Hf As predicted, any adjustment—either covariance or % principal axis--is 
better than no adjustment at $11. For the ideal situation (condition 1) 
all ihree methods demonstrate a small but equal amount pf bias (-.26 NCEs 
.unadjusted, -.16 NCEs covariance, and -.13 NCEs principal axis).- For 
condition 2, the ideal situation for use of principal axis, principal axis 
produced the least bias estimate, differing by only , .07 standard score 
units from the actual gain. v In condition 3 where unequal principal axes* 
violate the fan-spread hypothesis, principal axes produced less bias than 
covariance (-4.19 NCEs vs -5.80 NGEs, principal axis vs covariance'* 
respectively), but substantial bias was present. In condition 4, where 
again the fan-spread hypothesis is violated, covariance and principal axis 
produced about the same amount of bias (-4.44 NCEs and -3.97 NCEs 
respectively. 9 

In summary, bias was least where principal axes of the groups were 
equal. The primary implication is that the principal axis adjustment is 
best in situations Srtiecg th£ fart-spread hypothesis is operating. In any 
case, the principal axes method prodiKes the least bias, although as 
conditions degenerate, greater bias appears. Finally, all biases were in 
negative directions, which would lower treatment effect estimates. The 
authors recommended that despite difficulties in discerning whether the 
fan-spread hypothesis is in effect, the principal axis Method is best in 
all non-equivalent group situations, since it produces the least bias and 
any treatment effect seen will reflect the minimumgainS made, since the 
bias is in the negative direction. 




50 



46 



Summary 



emy ol 



It appears as though, the problemyof Model B are somewhat more 

surmountable than the problems associated with Model A, The control grouff 

design has a strong background of support in, the evaluation community. 

However, it is crucial that the data be examined for equivalency between 

control and treatment groups. If non-equivalencies exist, differences 

must be adjusted for using the appropriate method. Although, not^ wel 1 

discussed in .the literature, large attrition can affect either statistical 

adjustment. Although use of the principal axis adjustment assumes that 

the fan-spread* hypothesis is in effect, violation of the fan-spread 
* 

hypothesis does not appear to have such drastic effects that evaluations 

w^uld be invalid. However, in studies where possible, gafns^are" unknown or 

t f • 

where they may be small, the evaluator is cautioned against the use of 

principal axis if there is any question of violations of the fan-spread 

/hypothesis (e.g., unequal principal axes), since its use 'may lower 

treatment effect estimates thereby overshadowing real gains. 



D. ASSUMPTIONS OF MODEL C 

* f 

Model C is specifically designed not to.have an equivalent control 
group,. A comparison group is formed by including all students scoring 
above a specified score on the pretest, while all those who scored below^ 
this score are included in the treatment group. Pbst on pretest 
^regression lines are fitted to both group^ 6 . If there was no treatment 
effect, the Title I regression line would be 1a downward extension of the^ 
comparison group's regression line. If the treatment was effective, then 
the regression line for the Ti^le I gr£up would lie above and parallel to 



I erJc - 



51 



p 



47 

the. regression line for the non-Title I group. Two assumptions must be 
met for this model to function properly. The first is firm adherence to 
the cutoff score. The second is that both regression lines are 
parallel and linear. 

Adherence to Cutoff Scores . * 

^— : , v • 

The use of a distinct cutoff scor^ls^re^i>ed in Model C to 
distinguish between trie control and treatment groups. Unfortunately, this 
is probably the most common problem encountered during 'Mo^l C 
implementation. Despite the requirement of a strict cutoff score to 
separate Title I program participants from the comparison group, many 
programs do not adhere to the score (Yap, Estes, & Hansen, 1979). There 
is often ^gray area such that the single cutoff score becomes a band of 
cutoff scores. 

Yap et al. examined the various evaluation ^outxorpe^ which could occur 
' when a tand of cutoff scores is used. These conditions included: (1) 
adherence to the strict cutoff as outlined by Tallmadge and Wood (1976); 
(2) use of a band of cutoff scores, but students in the band excluded from 
analysis; (3) use of a band of cutoff scores with those in the band 
randomly assigned to treatment or comparison group and included in the 
analysis, and (4) use of a band of cutoff scores with students tn'the band 
assigned to treatment or comparison group on the basis of teacher ratings 
and included in analysis. } 

Using computer simulated data, they found that when a strict cutoff 
was used, relatively unbiased estimates of effects resulted. In the 

second condition, Where students in the gray area were excluded from 

t 

analysis, only slight differences between estimated and actual gStins 



52 ; ' ' 



\ 



48 



appeared (all less than 2.0 NCEs). When students in the gray area were 
randomly assigned to groups, practically no bias was introduced. Howler, 
in the f^inal case where teacher ratings were incorporated into placement, 
several sources of bids resulted. For almost half of the instances, the 
difference between estimated and actual gains was greater than 1.0 NCE and 
reached 3.36 NCE in one case. Bias tended to increase as the width of the 
gray area- increased and as the number of students in this area increased. 
In general, bias for all conations, when it occurred, tended to favor the 
treatment group in that the estimated gain's were higher than the actual 
gains. However, the use of teacher^ratings tended to suppress actual 
treatment effects, yielding an estimated gain which was less than actual * 
gain. 

difficulty with the Yap et al . .study is that the data were 

/ 

simulation the basis of equal growth rates across all students. As was 
seen in the equipercentile growth assumption of Model A, lower functioning 
students do not necessarily have the same growth rates as higher 
functioning students. However, these results do suggest guidelines to be 
followed if additional variables are used in selecting students for 
Title I programs. If students who do not meet 4 strict cutoff on the 
selection measure are permitted to enroll in Title I programs, their 
scores should not be included in the analysis when the program is 
evaluated. 



Parallel and Linear Regression Lines 



4 



Model C is based upon the assumption that regression lines for 
treatment and control groups wilTbe parallel and linear. The User's 
GuidT~ states that, a test which prodbtces curvilinear regression stiould not 



53 



be used. However, it may t>e difficult to determine beforehand whether 
curvil inearity exists with a particul-dr^ group, As mentioned briefly in 
tfie section on out-of-level testing, floor and ceiling effects tend tp 
result in curvilinear .regression lines (Estes & Anderson, 1978). 
Additionally, parallel regression lines are based on the assumption of 
^qual maturation rates between the two groups. Unlike Model A, this 
assumption is not usually violated since all students—comparison and 
treatment— are drawn from- the saw local, population, whereas in Model A, a 
subset of the local population is compared to a national standard. 

J Unfortunately, even if the evaluator adheres strictly to the User's 
fruide, there are many instances where non-linearities co^ld appear. The 

v 

few articles which distuss* nonl inearity are attempts tb make adjustments 

so that nonlinear data can still be used. 

^^Estes and Anderson (1978) analyzed the pre-post test scows of 730 

n*nth graders on three ma£h tests (Cbmprehensi ve Tests of pasic Skills 

Math Subtest (CTBS), Shaw-Hiehle Individualized Computational Skills Test, 

and the MinimaJ Mathematics Proficiency Test (MMPT)'to test the 

• * 

no-treatment expectation for Model C. Hypothetical treatment and control 
groups were formed and analyzed as dictated by Model C. In spite of no 
treatment, estimates of treatment impact in NCEs at pretest- mean and 
cutoff score were 4.4 and 1.3 for the CTBS, -6.30 and -4.42 for the MMPT 
and 2.15 and 2.90 for the 'Shaw-Hiphle. Statistical tests comparing 
"treatment" and "control" groups 1 results were statistically significant 

Y 

at the .05 level in almost all cases. 

Floor effects were detected on the CTBS pretest and ceiling effects 
on the MMPT posttest. These effects were denpnstrated by unequal 
'treatment e'ffect estimates $t the ^pretest mean and cutoff score. The 




50 



.^uthor-s recommend that if f loor op^xei 1 ing effects are present, then, these 
students- should be deleted fronMhe, evaluation. Unfortunately, deleting 



^^ t ch ; s^p?ts may ^ alter the .composition of the population in terms of 
gr^wtj^iTice specirtcal ly< deleting higher level students or lower level *, 
^ students is not random deletion,. 

• * v ■ EChternacht and Swinton . (1979) propose four possible solutions to the 
curvil inearity problem. These include Modeller and Tukey's re-expression 
of pattest scores, differential, wei^iting of scores rn different parts of 
the pretest, use of quadratic regression '1 ines arid extrapolation to obtain 
the noc-treatrpent expectation^ and the use of fittincj parallel lines to the 



data for\^h1^Qjjp* Each method h€s drawbacks. * 

L * Hosteller^ and Tukey's re-expression ofrthe posttest scores is a- 

roug'h method and is described as "as much' of ; arf as a science". , If few 
L : > > * 

datfpo>nts exist,- results can vary depending on which points, are selected , 

for re-expression, which may result in unreasonable expectations. 4fc 

If a computer is available to the Valuator, the technique of — 
weight ing ^scores may.be applied^ However, Echternaqht and Swinton 
concluded .that such weighting produces essentially the same^egression 
function as one/wou.ld achieve using -quadratic fits. If the data are 
nearly ltn^ar such fits r/lay work well, but when there are few data points, 
or if* an unusual function exists, fitting or gxtrapol atin^ frpm the fit 
can be dangerous. 

< * 
Final ly,^Echternaiht and Swinton discussed the us£ of parallel lines 

^ ' * P 

•fit, i'.e v , analysis of co<ariance. The traditional Model C approach- 



differs from the parallel ^n&s } fit in that the traditional approach 
regresses the data from the^eontrol group and extrapolates to .the * 
treatment g^oup^Jpereas jfflEc lel slope fits all of the data on th^ 



•erjc', ;• • ' • ' 55 ' v 



m dssumption of parallel sltfpes. When ceiling effects are* present, parallel 
slope fit is better than the Mo^el O approach When floor effects are * 
present, the Model C approach is better. If a treatment x pretest 
, interaction occurs, Model 6 may be better, sinqe the parallel lines fit 
will confound such an interacts with 4:he estimate of treatment impact. 

. However, if no such interactions -present, parallel 1 ines fit is*bptter 
provided the control §roup is^Kgr.eatly larger than the treatment group 

^(Cochran, 1969). Echternachfand 's\tnton concluded that while Model C 
worlcs well in the absence of floor and ceiling effects, when such ^ 
non-linearities are present the data shoulcf be fitted a Variety of ways 

' ' ' X 

and results compared. IfYits with the parallel lines procedure and Model 
C procedures provide similar results, the curvil inearity is probacy not 

serious. * 

i 

-Summary ^ 

^jr\ terms-of rigor, Model C falls between Models A, emd B. As in ^ 
Model B, the problems* encountered appear to be surmountable. However, for 
Mode) C to produce val iA results^ evaluators should firmly adhere to the 
cu^rf ••score. Special attention should be given to, floor-ceiling test 
Effects which create non-Yinear regression lines. When non-linearity is 
unavoidable, the data should be fitted according to procedures outlined in 
Echternacht 'and Swinton (1979),. • /* ^ * " ** 



THE COMPARABILITY OF MODELS 

According to the User's Guide, the use of any of the models will 
yield comparable results. This aspect is extrerhely important since 
results &re intended for aggregation at SEA and national levels.- If the 
models are not comparable, this aggregation will produce inappropriate/ 
faulty summaries* regarding the impact of Title I, Several studies have 
addressed tj^ "issue of comparability. While a portion of this research 
has been discussed in the previous section with regard to the use of 
normed versus non-normed tests within a given model, the following 
section will detail vtrious field studies and computer simulations that 
have^ directly compared the results obtained .f ron^lifferent models. 4i 

To properly compare different models, the assumptions applicable to 
all models plus the specific assumptions for each .model must be 
followed. For example, to compare* Model A to Model B or C, all students- 
being compared must: (1) be selected on the basis of a selection which 
does not serve as the pretest, and (2) must be test€d within six weeks < 
of publisher 's norm dates in addition to meeting the requirements for 
Model B (control groups only randomly different and appropriate analysis 
used to adjust pretest scores) and/or Model. C (contro?^ouf* *s all 
students scoring above a specified criterion on the pretest).* In 
addition, assumptions* concerning appreciate test, selection and 
administration must be met. In Almost every study purporting to compare 
two models, major assumptions are violated. 

One -of the major difficulties in making these types of comparisons 
is that the data cannot usually be examined post hoc. Since the 
decision to use a particular model is usually made beforehand, and a 
different model,** requirements are frequently net met. fable \. lists 
thoise studies which report comparisons and which, if any, 'assumptions 
were violated. 




53 



Table 1 K . » " 

Summary of TIER'S ifo del* Comparison Studies 



STUDT MO TCAj # 
Cr«fMTl Coth. l»r» 



NOOCU CQNrnPfJ 



csTwwn or rtumcMr 
gMg icggg HOCUS 

A - t {Klndorojrton) 
» > A (1st I 2nd frtdo) 



8Vs A {wiocil no, 



C0NNX8T1 * 

1) SolocMoo for portlclpotlon 
Not Mate / 

2) Soloctlon oa tests of olthor 
'ftovlon of pr*?1oo>.x*stin|' 
or -rf c o— u oitto* r> Child 
Stody Toon' • 

3) *> Soloctlon Tnst.ffcdol A 
ro*>lr%d). Jntorpolotod 
npTfJI 

4) Klndoroorton 1st t 2nd or«Oos 
oxoalnod, NHC only Intondod 
for 2*4 r »m ond oboro 



CIITlCAt MWEW 

rrecodwronot iHClflc 
mow** to U11 tsoctV 
■tiot fco p osndid, hoo- 
tvor, thoro or* snf> 
/Ulont vlolotlons 
Of t*0 WOtls U 
plotO OAy CO»dv«10n« 

In'doobt 



\ 2~ 



S. Arttr. i 
Zwortehek, 1979 



norm) 



rorJ£A0W6 ftodtl t l ..... 
1*SS toooct thou H»dol A 
■UH MtlDMt or loco) 
»onw (- 5 n -1 lis 
-5 4) 

For LM&MCf A«t1on«t/8 

<A r»c«T (.1 4<-3 5 



1) IntorpoUtod norm, Tntlno 
off'pobUshori dot** (hbdo) 
A ronvlrod), Control oroop* 
not rondonly dlfforoot (Nodol 
I rooolrod) . 

2) Attrition problofe dlfforont 
*nd gr**t*r 1n TitW 1 
sfhdbls vrrsws control schools 

3) Publishers trrt wp loclodod 
Tltlt ! stodonts. thortforo. 
?ro*t*r growth ot low Urol 
ochioroannt tnclodad. 

4) Control troop f oio m d to 
choct oo«rfp*rc*ntrtf trovth 
for Nodol A shonod porttntflt 
lotus simi«r to Tltlt I 
sttdints. 1 # . pcrcontllt 
not oofntolnod 



Too m*j violations 
of Nodal A to p*r- • 
■V trot contortion 
(It. lotorptUUd 
I norm out to tostfno 
off dot*. onoiotr- 
con 1 1 1 1 not Mlntafntf 
In control troup 



Cobrlol. Stnwor, 1 Ant 

Troy. r*77 



^ " *Ttl 

1st Crodo A t 

A/8 KMdlnf 3,4 vs $ S 

'A- | Not* 4,2 M | 7 

2nd Crtot: 
A) I ftcodlnn 9 T»s -2 8* 
No th % 8 1 wt 09 

3rd Crodo 
A • B toodfno 3 1 w 2 3 
A>t NtCh 7 9 »s 3 0 



1) Intorpo1«t«d mrm In trtdos 
2 I 3 noy novo c*os*d dlf- 
ftroncct ottnoon Nodols A 
ond t lrt*rsjplotod norm 
not «s*d in t*t front 

2) Authors spocoloto tltot 
rmwr»] of leu ochlorors 

b*1pS rost Of doss 

3) Olffonont ocnii hojiH tools 
mrt mod for mch mm] 

tmr for a. arr tor i) 

4) CorrtenUoi not snoclflod 



*%• Of UUrpoUtod 
noro« my now co»s*d 
tf1 f formers soon In 
2nd I 3rd orodot 
Too** mrm not Mod 
In 1st frodt *nd no 
\ii f foroAcn soon 



Nirdy, 1971 Us 
roponod by 
Norroy tt «1 
1979 «nd 
UAtomocnt. 
tMD) 



A> C (six* of dlfforoncoi 



1) Nost Nodol A Ustod foil, 
iprlnj nh11o«Hod*l C 
tostod sprinf. sprint 

2) Aoortfttod Acrosi schools 
tnd fridot but slnilor 
tcnoo* ilici ond clooo 
protott orrrsfoi . 



loisot CO«ld 
novo OOSlly COMOO 
lo^or Nodol C «t1- 
■*** Of trootOKnt 
1«0*ct 



R**d1n f . 

6r«do 4 C$1iontly hlfnor 
fltlMto tU* A 
(A - - 7 NCt. 
* C - I 5 PCX) , 

tndo t, A Sllontly Monor 
• ottfnouthon C 

U • 2.3 (trod* I), 
4,1 (ondo 8). C • 
1 4 {orodo I); 2.$ 
(frodo 8)) 

Nith 

Crodo 4--C>4 A- I C-l 3 
trodo »—C • A 3 5 3 4 
Crodo 8--A>C || $1 



1) Olffleolt to dotoroiw If 
both nodols vsod toon diu 
bo SO. , 

2) Slid of trwtaont ond 
control troops frootly 
t^fftront- In modlno 
6^»0» 4- 140« trootnort 
rt. 9)4lMntr*>: firtdo I 
IKS rs SS9; Crodo 8- 1 
rs 877, nn«mot In mtk 
rtrmod Crodo 4. CPS 
HOC. Crodo 1/ 789 rs 



11M 



3) rKopn i 



1032', HOB 



Stnto Nodol C roorrs- 
sfon linos opptor to 
bo onodjoitod for 
protnst scores ond 
slopts ntrt not nor- 
•Htl . U appro rs ti 
thttOfh H not not 
(nplonontod corroetly 
tt Is dlffieoU to 
ootororfoo oinctly 
s*«t tno procodnro 



4C0 dlfforont for 
1 C rofrotsfon linos 



Tollnob> I wood 
1980 



A »s 8 »<. C , 
3 dlfforont onolysls 
(1) 1«roo hot< 



(*) Mlltr bMonnooo, 
»4*8)U (To. $tS; ) m 
•eblovors). 
(3) saotl sonplo 

(sinoiotod projoct) 



9 

ERLC 




Nodol A toodod to prodoco 
Ptsltir* blosiooln ost lottos 
of dpproilnotofy 1 NCC Nodol 
C tondod to b« loss occrroto 
tnon A I 8. ond nest sonsltlro 
to dlstortoncos nltb Its 
•tsnnptlons (t o , fjoor- 
cpfllosj offoctsj. tot. tn 
CtNtlbU . Nodol s oool vol ont 



1 ) Utron* corn WOJ Uton not 
to vloloto tho bosic ossossp 
^tions of ony aodol 



On tno nholo. this 
SUdy hos opproprl- 
«U d*U to nolo Its 
conclusions ropordlnf 
ororoll nodol ooolvo- 
loney ohon corofvY 
control Is laplt- 



5 a 



* 



54 



In examining Table 1 only one of the six studies meets the x 
requirements for ^each. model being compared: Tallmadge and Wood (1979) J 
All of thje remaining fiye studies (Crane & Cech, k 1979; Faddis, Arter, & 
.Zwertchek, 1979; Gabriel, Stennar, & Troy* 1977; Hardy,, 1979;' House, 
l9?9) contain methodological flaws which could account for differences 
obtained. In two of these studies (Crane & Cech, 1979; House, 1979), 
methodology and procedure sections are not specific enough to make 
determinations about study validity. 

Tallmadge and Wood (1979) conclude that the models' are comparable, 
while the five remaining studies, albeit flawed, do not come to this 

\ / conclusion. Obviously, more properly controlled studies need to be 

performed in^this area" The current: col lection of literature, however, 



formed iiw 
inforwativi 



is informative from the standpoint of how models are actually^ 
implemented at the local level versus how RMC envisions their 
implementation (Tallmadge & Wood, 1979). If the models are not strictly 
implemented, it appears as though the "garbage in-garbage, out" effect 
holds. This problem, in and of itself,- may cast suspicion on the 
results of TIERS, not because of any problem within the system, but due 
to external factors. * ^ 

/ SUMMARY AND CONCLUSIONS 

The development of TIERS represents the most comprehensive 
attempt on the part of the federal government to regularly and 
objectively evaluate a federally-funded program of this magnitude. In 
examining-the validity of TIERS, the fact that it is a relatively, new 

V 

*£ndeaver must be kept in rpind. This revWw has examined the literature 
which assesses TIERS to determine if the system is and/or could valid 
as an evaluation system. ' ' 



ERIC 



55 



Three types of concert^ are discussed in the literature. These 
include: whether'the statistical assumptions of the models^ are valid, 
whether proper implementation of a g^ven model can occur at the local 
level, and whether the models are comparable. This section will examine 
these concerns aijid suggest possible solutions beypod Jhe recommendations 
made in the review. 

Validity of Statistical Assumptions 

* The first assumption of 'Model A is the equipercentile growth 
assumption. It is assumed that without treatment, students will 
maintain their pretest score percentile ranking on the posttest. There 
are serious threats to the internal validity of Model A. These are .due 

to -possible differences between the norming sample and ttre Title I 

V > 

sample in the fcreas of maturation, selects, and instrumentation. The 
second assumption of Model A is the regression. effect. It is assumed ; 
that statistical regression wiTl occur only between selection and 
pretest and riot between pre- "and posttests , A The literature does not 
support this assunption. 

The fan-spread hypothesis is , the primary assumption of Model B. 
This assumes that the difference between group means is constant over 
time relative to the pooled' standard deviation within groups. This 
assumption is only in effect when the prirttipal axis method of adjusting 

pretest scores is utilized, i.e., *hen the differences between treatment 

* * 

and control groups are presumed, to be fix&d. Demonstrations using 
simulated data have shown that violation of the fan-spread hypothesis 
will not drastically affect the estimate of treatment gains. 



6«> 



56 



Model C assumes parallel and linear regression lines. Parallel 
Jines follow from an assumption of equal growth rates in comparison and 
treatment groups. Curvilinear lines will result if floor or ceiling 
test effects occur. Curvilinear regression lines can be fitted, 
according ^o various procedures outlined in Echternacht and Swinton 
(1979). . * 

Models B and C have the strongest support in the literature, while 
Model A appears to have substantial difficulties. With regard to ,all 
models, the major difficulty apparent in TIERS is that no provisions are 
made for what to do if an assurftption has been violated. It is probable 
. that the typical evaluator will proceed regardless of whether an 
1 * assumption h-as been knowcsgTy violated. There are no provisions for * ^ 
salvaging violated data and st i 1 l^obtaining meaningful information. 

Proper Implementation 

Prtfper implementation of any model requires a number of decisions 
to be made. These include: what test to use, when to administer it, 
how to analyze the data; which model to use, and how to select 
£ participants and controls. Errors made in proper implementation fall y 
into two categories: technical errors and processing errors. 

Technical errors t£nd to occur vfhen user guidelines are not 

properly implemented. They can dccur in several areas including test 

selection, date of testing, data analysis, choice of a mode-1, and 

selection of participants and controls. These errors may be very hard 

to detect since the whys of specific (Jecisions 1n these areas may not be 

♦ 

known. » 
4 



9 

ERIC 



61 



? 



Processing errors occur during scoring and calculation of the 
results. These errors are difficult to detect if they result in > 
positive treatment gains, since such results are expected and do not 
lead to double checking of the data. Negative results due to processing 
errors tend to be corrected. 

.TIERS has no way of forcing proper implementation. Despite the 
step-by-step nature of the User's Guide, many areas, such as matching 
the test to 'the curriculum, are not fully covered. Technical papers 
prepared in conjunction with the User's Guide, while they can be 
helpful, may also be overlooked. 

Model Comparabi lity - 

Unfortunately, most of the literature in this area violates" various 
model assumptions, thus rendering the comparison useless. The excellent 
study by lal lmadge and Wood (1979) tloes conclude that the models are 
comparable. However, the remaining studies, albeit flawed, do not. 

Unfortunately, it is these latter studies^hat are representative 

of TIERS implementation at the local level. These studies violate a 

> 

wide range of guidelines established by TIERS or by other evaluation 
literature. If these types of violation are occurring at tbe local 
level, then it is difficult to evaluate the results of a project or to 
gather comparable data. 

Possible Solutions 

Due to the lack of support for its assumptions regarding equi^er- 
centile growth and regression, -Model A appears to be inappropriate for 
Title I evaluation. Models B and^ are more difficult to implement than 
Model A and are more costly, since at least twice as much testing is 
required due to the additional control/comparison groups. However, 



J 



58 



Models B and C, if properly implemented , appear to evaluate Title I 
projects more effectively. However, if the existing literature is any 
indication of what happens at the local leyeh proper implementation, of 
any of the models is not occurring. Hence, conclusions based on TIERS 
must be viewed with caution. Particularly for Models B and C, the 
suggestions made throughout this review regarding proper implementation, 
if used, would certainly ^sult in better evaluations. 

One solution in improving TIERS and its implementation is the use 
of wel 1-trained. evaluators at all levels Title I program evaluation. 
Currently, TACs oversee the evaluation sysjfem, give workshops in its 
usage, and consult with LEAs and SEAs on implementation of TIERS. 
However, much of the existing 1 iter ature, demonstrates that these efforts 
harve not been extensive enough, as the literature reporting on TIERS 
contain a wide variety of violations of the system. It seems as though" 
those who are responsible for the implementation of TIERS at the school 
level are faced with many of the decisions required by the system but 
lack the expertise to make these decisions. / 

Murray et al . (1979) have suggested the arean^ evaluation is not 
at a point where valid results and conclusions can be made by those who 
are inexperienced. It is probably a small minority of those implement-, 
ing TIERS at the local level who understand the consequences of 
violating the assumptions of the models or could recognize such 
violations. Title I and Title I evaluation are very costly, and it 
seems inappropriate that conclusions regarding Title I are drawn by 
inexperienced personnel. * 

Unfortunately, further training of school personnel or adding per- 
sonnel specifically trained in evaluation would be costly. Perhaps, it 
is time to examine what 'is really needed in-terms of evaluating 
Title I. • 



63 



The purpose of Title I evaluation is twofold. First, there is 
accountability for monies spent on Title I so that such expenditure can 
be justified. Secondly, there is information provided to the LEA and 
•SEA regarding whether their projects increase tMfrbasic skills of the 
students selected for participation. Currently, it is suggested that 
TIERS be u^d to evaluate 'every Title I program every year. 

T/itle I involves thousands of programs and millions of students. 
Many of these programs do not change year to year. The same curriculum 
is used ov,er again -provided some indication exists that it is effective. 
Changing the curriculum year to year would be very expensive.. The 
question arises as to whe^Her it is necessary to evaluate £hk same 
curriculum every year, considering the composition of the local 
population is often constant, ahd Title I participants are similar year 
to year. In terms of the purpose of Title I evaluation, solutions not 
requiring yearly evaluation should be examined. 

Three possible solutions exist which do not call for yearly 

evaluation. The most obvious is to stagger evaluation so that all 

programs are evaluated every other year, or every third or fifth year. 

A second solution is to evaluate only new curricula. Once a new 

curriculum is evaluated in a given school and shown to be effective, it 

♦ 

colild continue indefinitely without revaluation at that site. A third 
is to randomly sample Title I programs for periodic evaluation (every 1, 
2, 3, or 5 years). Th.is evaluation could be conducted by highly 
trained evaluators with the technical expertise necessary to. implement 
TIERS correctly. This study would' provide the information necessary to 
justify the program at the natural level. L£A*s would then be frefc to 



9 

ERIC 



60 

*se whatever evaluation procedures they want to make local prograimiatic 
and curriculum decisions. Of course some combination of these three 
options is also necessary. 

The basic ideabehind all of the solutions is that when evaluation 
occurs, itshould be correctly implemented. Evaluation, year after year 
of the same program is neither cost effective nor will it yield useful 
information unless proper implementation occjp. Non-anriual evaluation 
would colt less than training current ' local evaluators in proper model. 

• implementation or hiring additional personnel trained in evaluation. 
Non-annual evaluation would be done at trie school level by non-school 
personnel experienced in evaluation. Teachers and/or administrators who 
are currently carrying a full work load would not be given the 
additional burden of TitTe I evaluation. 

*ln conclusion, Models B and C of TIERS appear to yield useful 

information regarding the effectiveness of Title I. However, if the 
» 

model's are not properly implemented, (which appears to happen ^ 
frequently) the results are not useful and may lead to ^rfcorrect-. 
conclusions regarding the effectiveness of a given project. A solution • 
to this is to use personnel trained in evaluation at the school, level to 
ervsure proper implementation. Costs of such a solution may be 
- prohibttive. For this reason, non-annual evaluations are recommended in 
addition to the use of evaluators at the local level. Irnojejientation of 
this reconmendation would be more cost effective and would result in 
better evaluations. /» 



» 



65 



61 



A COMPARISON OF MODELS A AND B 



Procedures 



• For purposes of, comparing the estimated impact of the same Title I 
program using Models A and B, the Title I program in the Salt Lake City School 
Dmrict fn Salt Lake City, Utah was considered. The district's title I * 
program was being operated in eight schools. Seven additional schools* in the 
district had been included in Title I previously or wer*e being considered for 
potential expansion of the Title I program. 

Data regarding each of the eiqht Title I schools and the seven potential 
comparison schools on poverty levels mobi 1 ity, daily attendance, average IQ of 



student body, percent m 



inority, jan*| 



reading and matt* scores from the previous 



spring, were cpJlected. This informatio^ is presented in Table 2. "Based on 
an analysis of these .da^a, two schools (Wasatch and Nibley) were dropped as 
potential comparison schools, leaving five schools to be used in the actual 
comparison of Models A and B. There were no statistically^signif i£*tot^_ 
differences between aRy of the Title I or comparison schools on the^spMng 
achievement test data reported in Table 1. 1 
Guidelines for implementing both ,models suggested by Tallmadge and Wood 
(1976) were followed. The guidelines for implementing Model A recommend that 
tests be administered within two week's of the empirically established norm 
date, but allow up to six weeks if scores are extrapolated. In this case, the 
pretest was administered five -weeks before the empirically established norminq 
(fate. The posttest was administered two weeks before the empirically 
established norming date. In analyzing the data, adjustments were made by 



ERLC 



. / 



66 



• * Table 2 
School Means on Various Factors Used' 
in Selecting Comparison Schools' 





School 

Y 


Poverty 


Poverty 








X. 


79 

) 


Mobility 


Attendai 


1 . 


Frankl in 


34.9 


38.7 ' 


42.6 


97.69 


2. 


Jackson 


58.4 


47.5 


58.7 


97 .'70 


3. 


Lincoln 


65. 8 


66.5 


60.6 


95*. 75 


4. 


Lowel 1 


68.5 • 


38.1 


62.0 


98.18 


,5. 


L. (tenion 


45.6 


61.2 = 


65.2 


95.98 


6. 


Parkview 


29.3 


26 .2 


43.6 


98.54 


7. 


Washington 


52.7 


49.0 


72,3 


98.35 


8. 


Whitman 


29.0 


33.3 


44 .'7 


98.93 


9. 


Backman 1*^ 


28.9 


22.7 


' 46.5 


97.13 


10. 


Edison 


26.3 




.48.3 m 


98.20 


11. 


Emerson 


27.2 


30.3 


35.1 


97.24 


12. 


Hawthorn 


24.9 


18.8 


29.. 3 


97.88 


13. 


Nibley 


23.3 


19.8 


36.5' . 


97.28 


14/ 


Riley * 


24.5 


19.9 


42.5 


• 96.92 


15. 


Wasatch 


{9.2 


18.8 


27.4" 


98.00 



Percent 
IQ Minority 



Achievement Test Raw Scores 



Reading 



1 



Math 



98.4 
94.4" 
95.4 
104.7 
94.7. 
107.6 
94.9 
94.7 
97.1 
98.6 
109.6 
105.8 
117.6 
102.9 
110.8 



43 

52 

43 

15 

31 

38 

34 

25 

18 

30 

15 

16- 

13 

21 

■9 



Total possible raw score points 



113.18 
113.80 
101.49 
,128.33 
118.84 
96.26 
•115.54 
107.51 
121.94 
105.30 
114.27 
109.93 
106.76 
100.81 
128.59 
■ 147 ' 



116.62 
118:55 
114.28' 
134U0 
118.63 
114.34 
126.59 
111.93 
117.54 
111.59 
129.36 
120.41 
118.84 
1.03.85 
136.49 
158 



T 



78.64 
73.36 
76.67 
91.10, 
86.97 
78.59 
82.57 
81.52 
85.52 
72.80- 
88.98 
89.99 
84.65 
82.50 
.97.73 
125 



38.70 
43.65 
40.16 
47.42 
43.80 
41.38 
47.15 
43.00 
40.47 
41.19 
47.77 
41.85 
41 .-37 
43.86 
48:38 
64 



80th percentile .117.6 126.4 100. Q 51.2 



58.86 
67.15 
64.83 
71.36 
66.14 
64.42 
71.72 
65.36 
68.55 
61.64 
75.^8 
64.14 
64.29 
57.06 
79.54 
100 



55.20 
44.85 
47.74 
59.55 
"55.02 . 
4939 
54.13 
50.81 
55.96 
51.58 
61 .14- 
61.15 
50.15 
50.47 
67.54 
96- 



20th percentile 29.4 31.6 25.0 



80.0 76.8 



12.8 



20.0 



19.2 



ERIC 



68 



x- - 



... - • .63 

► , . * 

linearly extrapolating the cesuTts to account for the fact that students h«j . 
three additional weeks exposure to tR Title I program thin they would have 
hacf if the test been administered exactly on the empirically established norm 
dates. Adjustmentsjesulted^in reducing, the estimated impact (in -NCE scores}, 
of Model A- by about 8%.-Tests used in both the Model "a and '.Node 1'b .comparison 
were taken from the Stanford Achievement Test (Madden, Gardner", Rydman^ Ny . 
Karlson & Merwin, 1972) as shown in Table 3. 



j • • Table 3 

a *. / 
-Form and Level of the SAT Test Given for Model A' 
.and Model B Analyses * 

>' . - 






' Selection Test 3 


Pretest 


^Posttest* V • 


Grade 2 


Primary I , Form A 


Primary I ,' Form B - 


Primary I, Form A 


Grade 3 

* 


Primary II, Form A 


Primary II, Form B 


Primary II, Form A 


Grad.*j 4 


v. 

- Primary III, Form A 


PVima ry III, Form B 


Primary III , Form A 



The test identiffed in this column refers to the selection test used under 
typical circumstances. As explai ned" later in the report* a fewchil^ren 
were legitimately selected for Model A t^ing different measures.- Selection 
tear scores were ^unnecessary for^odel B. • , 

« 

. A "Analyses for the Model A evaluation were based on students selected in 
.two ways. F£st, because the school district ha/ been using their spring 

^ posttest as the selection' test for next years students, they had traditionally 
only,.4ncl^ded those students - in the analysis who , had 'spring, fall and spring 

* ~£^*««4jjft scores and remained in the same 'school from spring to spring. 




$ * ■ > 

n'dl mobility within thti district, this rented in many 

fg included in Title I j^ogra«5^in 'the farll who were not tested 
during tifre sprir(g testing-, and other children who wer*e testefcin one Title I 
school in t^spring and transferred to another Title I school for the next 
j^ear. AiVfcough there is no guideline against limiting the analysis to those 
children who have\spring select! on Jte'st data and stay in the same school from 
one spring to the fallowing spring, this procedure ignores th#>dat*of a 
^Obstantial number -of students for whom ^legitimate dat^-is available. 

The second selection method included additional students who had 
legitimate selection test scores evep though they did not remain 1n the same 
TUle I school from spring to spring These additional students could be 
included from two groups. First, students who did not enter the Title I 
school until. the fall, could still be included in the evauluation if they were 
selected bas ^ro^ an objective test that was separate from the pretest. 
-Secondly, som students took the spring selection test irr^one of the % 
district's Title I schools and then transferred out of that school into 



another 'Title I SGhool in the district. In the 'past, these students *Cho had 
trartsferred within the district had nat been^included in the analysis. 

The selection of students to be considered in the Model B_cornp^ri^on 
schools could also be done irf a number 6f ways whidK^o not contradict the 
guidelines provided ^ Tal Imad qe ag d-WOb'd Tl9^p) . Selection Method I ' a 
recogniz|d£ that even though comparison schools are reasonably sim^aTV one 
comparison school could be/omewhat higher, on the average or could be 
distributed differently ltfrar\ another comparison school. Hence thfc lowest l$% 



in. school A could have different scores from The lowest 15% in school B. In 
Selection Method I, children in all of the comparison schools were combined 
into one group and the percentage of children served in Title I schools 'served 



by Title I programs during the peak enrollment period (January/February) was 
, taken from the lower end of the test score distribution of the comp^ison 
school's fall test scores. 

/ In Selection Method II, the number of students in each Title I school 

^^ho were receivirtg Title I 'services during the peak^enrol lment periqd 

( January/February) ^Was calculated as a percentage of that school's total 

student body. The median percentage of students being served in each of the 

eight Title Pschools was taken as an average and this number was? used in each 

of the comparison schools to select that percentage of students from the lower 
** 

. end of the distribution of the fall achievement test scores ir/each school . 
Since all of the comparison schools are essentially similar to each o^her and 

I^Tthe Title I; schools (see Table 2), this method should provide a comparison 
group which -is reasonably similar to th^^Title I group. 

* • Selec>K>n ^Method III was similar w.ith Method I fn'that children were 
selected f r^om^jU^Jower end of the test score distribution after the 
comparison^sVtwCols -had ,been pooled into one group. However, -this occurred in 

, two Stages- The same percentage of children was taken from the grqup of 
comparison* schools .that was taken from the Title I schools based on spring 
Jesting data. Natural attrition of students Recurred between .spring and fall 
in the compaVison schools as it did in the Title I schoojs. A new group of 
children >as selects from the lower end of. fall test scores distribut ion' in 
the comparison schools and added fo the comparison group u*ing the same 
percentage that was added to the Title I schools based on fall data. 

Although Selection Method 1 1 1 *Vor the comparison. schools is clearly the 
most nearly like what happened in the Model A and consequently is best for 
comparing the results of Models A and B, ft is -not likely to be used if Model 
l^were being implemented by the district. Selection Method I would be the 



66 



most plausible one for a district to implement. Moreover, Selection Method 
II, although not nearly as defensible empirically, is technically in agreement 
with the guidelines suggested by Tallmadge and Wood (1976). It is important 
to emphasize that Mqdels A and B can 'be implemented correctly using very 
different groups of children as a basis for making the comparison about 
whether Title I programs are having any impact. 



Results and Discussion 



Shown in Table 4 are the NCE growth estimates for grades 2, 3, and 4 
using Model' A with only those children who were selected during spring testing 
and did not transfer to another school within the district* ( A) , those children 
who were selected* dyring spring testing plus those children who were 
legitimately selected during the fall or who were selected- during the spring 
and then transferred to ".another Title I school within the district (A ), and 
Selection Method III in»Model B which is probably the most rigorous and most 
similar to the way <^at -children were selected for Model A. NCE growth 

a 

estimates in both' versions of ^re Model A results have been adjusted using a 

linear extrapolation to ^count fc*r the fact that students were exposed to 

33 weeks of instruction between the pre- and posttests rather than the 30 

weeks of instruction that would have resulted liad the test been given e/actly 

on the empirically established norm dates. Dependent variables fn all cases 

are-subtests of the Stanford Achievement Test (1973 version). Average reading 

and math scores are unweighted arithmetic averages of the individual subtests 

\ 

which were administered at that grade level. Blanks in the table indicate 
•hat that particular subtest is not included in* the level of test administered 
to that grade. For example, reading comprehension* is not included in the 
level and form of the test administered to children in second grade. 



Table 4 

Title I Program Impact in*NCE Gains for the 
" Same Program Using Different TIERS 
'Models and Selection Methods 



l 




Grade 2 


9 




■ A 


A' 


B 


Reading Part A 


9.7 


10.8 


-1.8 


Reading Part B L 


.6 *2N 


. 6.5" 


-8.4 


Word Study 


5.9 


\3.8 


-.5 


Reading Comprehension 


' - 


- . 


- 


Average Reading 

• 


7.3 
(n=50) 


7.0 
(n=76) 


'-3.6 

(n=84) 


Math Concepts - 


.7 


' 2.6 


-12. 8 a 


Math Computation 


4 3.5 


,6.7 


-14. 1 3 


Math Appl icat ions 






- U 


Average Math 


2.1 
(n=40) 


' 4.7 
(n=70) 


-13. 5 a 
(n=75) 



A 


braae o 
A' ' 


B 














3.1 


4.0 


-3.5 


2.7 


3.5 


-1.7 


* 2 9* 3 8 -2 6 
(n=67) (n=100) (n=103). 


< 

5.2' 


5.6 


Al.O 


7.5 


7.1 




4.1 


,2.3 


V2.3 



A 


Grade 4 
A' 


« 














-.4 


-.5 


-4.4 


7.7 


7.7 


5.2 


3.7 • 3.6 .4 
(n=U5) (n=l46) (n=156) 


4.4 


4.0 


-.6 


8.9 


7.9 


-3.6 


5-.1 


■ 5.2 


2.8 


6.1 5.7 7 ,5 
(n=H0) (n=l40) (n=l44) 



5.6 5.0 /-1.4 

(n=63) • (n=89) fn=96). 



NOTE: All numbers 1n parentheses refer to the number of students 'in Title I programs for whom data were 
available for that particular grade and evaluation model." 



'ERIC° 



Data for Model B on 2nd grade math should be viewed very skeptically because so few scores were available 
and pretest "scores for available students were much lower than pretest scores in Title I schools. This 
was the only grade and test area where this occurred. 

■ • i ' - 



68 

As can be seen from these results, Model A consistently yielded higher 
estimates of prograw impact than did Model B. Average differences for read- 
ing and/or math range from a low of 3.3 NCEs to a high of 10.9 fiCEs (this is 
discounting the one difference of 15'. 6 NCEs on the average math for second m 
grade studen.ts since scores for these subtests were based on very' few control 
group students and had a number of anomalies tjiat make the data questionable. 

Using the results from Mode^ B, (which is theoretically the more rigorous 
model), it appears that the Title I program had no positive impact over and 
above what students would have achieved in the regular school program. Using 
the results of Model A, it appears that Title I is having a substantial 

y 

positive impact. , * 

Table 5 shows the estimated impacts of the Title I program using the, 
three different selection methods for Model B described earlier. Dep^idir>g 
on tf^Fselection method, very different children could be included in the 
comparison sample. Not only flo the three methods differ in the children 
who are selected, but attrition in*the three groups due to mobility or lack of 
test scores was probably systematically different in unknown ways from group 
to group so that the actual comparison group can be very different even though 
th^y are each selected in accordance with the guidelines. 

The results in Table 5 show the average -NCE gain on each subtest at each 
grade level using eachof the three selection methods. Table 6 presents the 1 
same information in a different form. For each subtest at each grade level, 
the Vow estimate of Title I .impact was set equal to 0 and the numbers for the 
other methods represent the difference^ between that method and the method 
having the low impact. Averages at the bottom of the table are an arithmetic 
average of each of the cell entries. The overall average to the fight 
indicates the average across all cellos for each method. 



. Table 5 

Impact 'of Title I Proqram Using Three Different 
Selection Methods for Model B with Principal" 
Axis Adjustment 



SAT Subtest 


I 


Grade 2 
II 


III 


Reading Part A 


-2.1 

(82) 


.15 

(82) 


-1.76 
(82) 


Reading Part B m 


-9.6 

. (82) 


-6.6 
(82) 


-8.4 

(82) 


Readinq Word Study 
*<* 


-1.7 
(84) 


-1.2 
(84), 


-.47 
(84) 


Reading Comprehension 






> 




Math Concepts 

^ 


-14.8 

(75) 


-15.0 
(75) 


'-12.8 
(75) 


Math Computation 
* 


-15.4 
(75) 


-10.9 
(75) 


-14.1 
(75) 


Math Appl ications 











Grade 3 








Grad^ 4 




I 


II 


III 




I 


II 


III 
















* 












» 


-2.9 
(103) 


-2.4 
(103) 


-3.5 
(103) 




-5.1 S 

(156) 


"^76 

(156) 


-4.4 

(156) 


-1.1 
(103) 


-1.1 
1103) 


-1.7 
(103) 




.34 
(156) 


.37 
(156) 


5.2 
(156) 






• 




4 


* 


-3.1 
(96) 


-.54 
(90) 


. 1.0 
(90) 




.7 
(144) 


-1.7 
(144) 


-.6 
(144) 


-5.3 4 
(95) ' 




-2.9 
(95) 




-4.6 
(143) 


-2.4. 
(143) 


-3.6 • 
(143) 


-7.0 
(89) ' 


-2.5 
(89) 


-2.3 
(89) 




3.0 
(134) 


2.0 
(134) 


2.8 
(134) 



Note: Selection Method I 



took median percentage of children served in each Title I school from the group of 
children in all comparison schools based on fall scores. 



i 



„ Selection Method II took median percentage of children served 1n each Title I school from aach comparison 

school based or\fall" scores. * 

Selection Method IU t«o-stage selection from group of children in comparison schools based on spring test 



Table 6 

Differences in Model B Results usinq Three 
Different Selection Methods 



V 



SAT Subtest 


I 


Grade 2 
II 


III 


Reading Part A 


0 


2.3 


.3 


Readinq Part B 


0 


3.0 


1.2 


Reading Word Study 


0 


.5- 


1.2 


Reading Comprehension 










Math Concepts 

*— ^- 


.2 


0 


a. 2 


Math Computation 


0 




1.3 


Math Applications 









.04 
4 



2.06 1.58 
1 0 



comparison schools based on fall scores. 



I 


Grade 3 


III 


4 

> 


r 


Grade 4 
II 


III 








- 


- 


- 












- 


.6 


1.1 


0 


0 


1.5 
— • 


!*7 . 


.6 


\6 


0 


< 

0 

1 




4.9 












0 


2.6 


4.1 


"2.j 


AO 


1.1 


0 


8.6 


2.4 


0 


2.2^ 




0 


4.5 


4.7 


1.0 


0 


.8 


• 24 V48 2.24 .68 .76 
J V 2 3 2 

served in each Title I school from the group of 


s 

1.7 
0 

chi 1 dren 



I = .32 
II = 2.10 
III = 1.84 



Se Tan°sco^es° d " ^ Rentage of children served in each Title I school from each comparison schodfcbased on 

> S ^^n°Sdd^n^ h ^^ild^e-n^ a ^d Se Jn eC ?:?7 {£ RVMSBT! fcfafc M.'HLr" ^- ™ ^ 



-si 

O 



er|c 



73 



7^ 



J 

As can be seen, Method II is consistently lower* than either Methods I or 
III. Methods I and III yield similar results. The important fact is that 
although some systematic differences may persist over time using these three 
different selection methods, the attrition rates are influenced by so many 
other factors and directly affect estimates of impact that it is difficult to 
prec*ic^ the differential impact of each method. 



72 



CHAPTER IV 



DEGREE TO WHICH ASSUMPTIONS MADE B¥ MODEL A 
ARE MET IN UTAH TITLE I EVALUATIONS * 



Procedures 

v V 

In addition to the comparison of Models A and B, 11 school districts 

were visited by project staff to investigate the degree to which districts 

utilizing Model A were violating assumptions of the model. During visits to 

these (fistriets, staff members conducted structured interviews^t^h district a 

Title J directors, principals, and teachers in Title I schools to collect 

* 

information about each of the Model A assumptions noted earlier. Ifl^ddition, 
staff members observed Title I classrooms during the administration of the^ 
sprincj testing to determine the degree to which the^ procedures suggested in 
the publisher's manual were being followed during test administration and the 
degree to which teachers and students were t)n or off task during the test 
adrtnnistration. Topics about which questions were asked during the interview 
<H4M^ L£ A staff members included: * 

1. The rationale for the particular test that was being used (both 
* publisher and level); ^ 

2. Policy and practice for conducting make-up tests; 

3. Policy and practice for checking the Title I data which was submitted 
'for accuracy; 

4. Adherence to the policy of not using the selection test as a] 
pretest; 

&. The time which tests were administered and whether or not adjustments 
were made when the testing date varied more than two weeks from the 
empirical norm date; and 



ERJC 81 




6. The perceived relationship between the test content and the 
instructional emphasis in that school. 

* Research data was collected during on-site visits to 15 schools in 11 
1 districts. Eleven" of the 13 districts required to report TIERS data to 
the State Office during the 1979-80 year were included. These 13 districts 
are a purposefully -selected representative sample 'of all 40 districts in the 
state. Visits to all 13 districts were planned, but last-minute schedule 
changes with the district interfered and resulted in cancellation for 2 
.districts. The Title I director of. each school district was notified by mail* 
(Appendix #17 a month in advance of the on-site visit and familiarized with 
the purpose and methodology of the visit. Thre,e weeks 'prior to the visit, the 
Title I directors were contacted by phone 'and give* additional information as 
well as an opportunity to ask questions. Two weeks prior to the visit, the 
school principals were likewise informed (Appendix #2) and provided, with ,a 
memo which they were^asked to send to teachers and aides who would be visited 
(Appendix J3). * # 

Each school district was visited for a day by one to three data 
collectors.- The data collectors arrived at the schools af8:00 a.m. and 
individually' interviewed the principal W-se4ected teachers and aides for 
approximately one hour. The data collectors theh visited prearranged class- 
rooms and unobtrusively observed the students' and teachers 1 . testing behavior 
for approximately 45 minutes. After a short break, the data collectors 
administered a Format Familiarity Test to several students in an empty room ' 
for approximately 20 minutes (see Chapter V for a more complete description of 
this component of the project). The Title I' teachers and. aides w^re then 
interviewed again for approximately 30 minutes and asked to fill in a 




curriculum content' survey. At the en<Tof the day, the Title I director*was 
* ** * » 

•interviewed at.fhe district off ice ,and informed of the day 'i event's^ 

IFive types of data were collected during th£ visit. 

4 ® 1. Open-eaded .Interview . ^ * - V 

The purpose pf .the open-e*ded. interview was to: ij (a) determine 

* the awareness of Title I personnel to. possible Title I Model A 

*/ • m 

^ violations, (b) examine n the ro>es of specif ic personnel in each 
^district in collecting, collating and distributing TIER?^datar, * 
(e) investigate £he coherence and communication amoraj the, Title I 
^/ persoftrtfel within a di^rict, \d) develop* an impression of the .degree 

'-of involvement' and satisfaction of the $rf\e I personnel w^th^the 
program and evaluation techniques, and identify common problems,, 
comp^injfcs and -sources of^possi(jle fifture difficulties (see^ f . 
S Appendix A ^fr th$ interview guide sheet used). 
2, Determination of On Task' Behavior During Testing 

Title I testing periods were observed by oneto three trained 
observers tor 20-miiuite block's, R-ive students arid *o^e teacher were ^ 
observed in eacrTblock,* On and off task behavior was recorded irt 
5-second intervals, using a tape recorder ^nd^arpKonfes as a pacer. 
The Obset^^js^ were trained at^the .university prior to theon-site ^ 
vfeit using^both video tapes and actual classroom observation and 



attained a minimiwiof .85 ^inter c rater reliability.. The d^ta^ ♦ 
l kollefticm iom and definitions jGr^s tudgj ^ and teacher on* task and - 
j J off task behavior are ins^uded in Appendix 5* Futthpr: .in fo^ation- on 
the procgdifreS used forv£ol leering these data is cOTtainecT'jn 



Chapter VI ' 



75 



3. Quality of Test Administration Checklist ' * ^ 

♦ . ■ m 

Environmental, "instructional, and situational variables which 

contribute to high-quality standardized test administration werfe 
recoiled using a dichotomou^ checklist. Variables included in the 
chetklist were identified based on standardize^ test administratioo 
* manuals and textbooks on standardized test administration. 'Data 

coljeotion procedures practiced at the univeMty prior to th? on-sitf* 

•+ 

visit attained a high degree of interrater reliability (.90 or 
'better). The checklist used is include in*Appendix 

4. Format Familiarity Test (FFT) 
The'FFTwas administered following the^itle ,f/T#sting period./ 

The Title ^ teacher &r ,aide was asla^Hto supply four to six average^ 

Title I students so that they'gpuld be administered a test to 

' ' • f 

detenafrie if* a student demonstrated knowledge of reading phonies' 
. \ t * * 

differed depend1ng*on thre format in which the test was administered. . 

-.After a shocj: rest period, these|£tudents were jested in an empty , 

^ room for approximately 15 minutes. Additional inforiMt ton about the 

procedures and results of is activity are includec^n Chapter V. 

,5; Curriculum Content Interview , ^ 1 



> 



The 5 Title I teachers and aides were queried about the content of 



their reading curriculum and the -relative emphasis.and importance - 
they accorded. the ^rious content areas. The interview format used 
during t i s discussion is shown in -Appendix 7. j 
In addition ^to^hese data'col le<;tion#ctivities, a ^ariety of ^additional / 
assistant was' provided to' the* Salt Lake City School District in implementing 
TIERS. -'One Qf the major; activities was assisting .in developing a Workable 



procedure and set of„def in it ions fot^ computing student/teacher ratios *in 
Title I programs. Based on the resulting procures (see Appendix 8), Title I 
teacher leaders were interviewed, and 'data was collected for computing the 

r 

studfcnt/teacher ratios of thdir Title I programs. 

Tfie districts and schools visited, the dates of visits, and -the personnel 
? interviewed to col feet data in the fjve areas listetf above during this 
component of *the project are listed in Table 7. The type and amount of data 

collected from each school and grade level are listed in Table 8. ^| 

Results 

* 

Interviews with LEA Staff 4 

The -open-ended irftervi^v^ consisted of five p^rt s : 1 ^students 1 - - 
^reaction to testing, \2) district personnel reaction to testing, (3) selection 
of students, (4) test administration, and (5) procedures for submitting TIERS 



data.* "Data was Collected from thr.ee categories of personnel: (1) Tftfe I 
^directors, (2) principals, and (3) Title I teachers^and aides. The results 
were as follow^ ■ * 



V 



Student reaction to testing. The majority (55%) of the Title I directors 

' wer^jiot famil-iar wtt+v-their students^ reactions to testing. Of the 

remainder, 50% saidjthe students were not negative toward testing and 

understood it, while 100% felt the. students generally behaved weH and*tried 

their best. | ^ . t 

w The principals were general ly. positive about th& students 1, react ion* to 

the tiling (62%) and their understanding of the purbos? of the test (&2%). 
' y> $ . J - J. I • r 

All of the principals felt the students ^general ly b^h&ed vtel 1 *J>ut onJ£?^9%" 

** m ' . " + , 

f^lTTfhe. students tried their £est. 1 , » • ^ * 



Table 7 

Title I Field V.isit Information 



77 



' District 


Date 
Visited 


0 

' Elementary 

School Visited 


c 

Personnel Interviewed 


*. Alpine 
i. a 


4/29 


Greenwood 
Greenwood , 


Dr. Tregoskls - Title I Director 
Mrs. Brannon - Title I Teacher ,9 , 
Mr. Crandal - Principal 
— \ ; 


"Box Elder 


4/28 


* Lincoln 
Lincoln 


Mr. Harding - Title I Director' v 
Mr. Stanger - Principal \ 
Mrs. Hogart - Title I Teacher \ 

^-7—- ■ 5 1 ' ^ 


' Duchesne 


4/2 


Duchesne 

Duchesne 

Buchesne 

Mytfrn 

Myton 


Mr. Hansen - Title I Director . 
. Mrs. Gelest - Rrlndpal 

Mr*s. Mel drum - Title I Teacher 
Hr. Duke - Principal % •» 
.Mrs. Roberts - Full Time Aide 


Morgan, 


4/*4 


Morgan 
'Morgan 


Mr. Jeffrey - ftkle^ I Director • 

Mr. Harna]- Principal 

Mrs. Adamson - Title I Teacher 



Murray 



Park CUy % 



Provo 



SaU lake* 
City 



S. Sanpete' 



4/9 



5/12 



1/17 



5/6 



5/6- 



S~ Suwnlt 



5 ' 7 



Vlewmont* 

Vlewmontr 

Vlewmont 



•Old School 
Old School 



Tlmpanogas 
Tlmpanogas 



Parkview 

Parkview 

Parkview 

Washl ngton 

Washington 

Bennlon 



S. .Summl t 
S* Jummlt 



North 
North 



Dr. Bert]eson - Title I Director 
Mr. Campbell - Prlncjpal 
Mrs. Middleman - Title I Teacher 
Mrs. Froellch - Title I Teacher, 



Dr. Falls ?- Tftle I Director 

Dr. Falls - Principal 

Mrs. Shonon - Title I Teacher 



Dr. BMmley - Title Lbi rector 
Mr. GuntfSer*- Principal ; 
Mrs. Murdoch - Title I Teacher 



Mrs. McDonald - Title I Director 
Mrs-. Weggeland - Principal 
Mrs. Crawford - Tftle I Teacher 
Mrs. Taylor - Title- I Teacher 
Dr. Comb - Principal * 
Mrs.' Jackson - Title I Teacher 
Mrs / Dolee - Title I Teacher ^ 
j 



Dr. Graham - Title I Dlrectol 
Dr." Graham - Principal • " 
Mrs. R1chard»n - Title 1 Teacher 



4 

< 



Mrs," "link - Title I Director 

Mrs. Walker - Principal 

Mrs. Marchant - F111 Time Affle 



Mrs. Balrd - Title I Director 
Mr. Dayton - Principal 
Mrs. Ivy - Title .1 Aide 



ERIC 




Table 8 



Data Collected Regarding TIERS Implementation in 'Eleven Utah 

School Districts 



ALPINE 
BOX ELDER 
" DUCHESNE 

'MORGAN 
MURRAY • 
PARK CITY 
PROVOv 
SALT LAKE 
S. SANPETE 
S. SUMMIT 
x WASATCH 



87 



ERIC 



PERSONEL INTERVIEWED 







r- 


00 






s. 


<T3 








o 


CL 


w S- 


1 ^ — 




+-> 






oo 


0) 




U 


<D SZ 




r— 


0) 


c 


»— U 


r- <D 




s. 




fO 






•r- 




•i- <L) 






Q 


CL 


1— t— 


1- < 



X 


! 

X ! X 




X 


X 


X * 1 


X 


X 


X 


! 

X 


x : x 


1 


X 


r 

X 


X 


X 


X' 


X 




X 


; X' 


. X 


X 




X 


X 


X 




X 


X 


x - 




X 


X 




X 


I 

1 

X j X * 




X 



-4 



OP STUDENTS 
ADMINISTERED FFT 



6 


— • — ! 


4 


4 
















6 


( 








8 




6 




6 

"IT i 


6' 



1 



# OF ON/OFF 
TASK PERIODS 



CURRICULUM 
ESTIMATE 



4 5 6 





2 




- r— 


1 




1, 


1 












•1 


m 


2 




2 


















1 


1 1 


1 
i 


i 


1 




- r 

* 




fl 


2 


2 






' 1 
—i 




2 


2 


2 










1 








• 1 






1 




2 


1 


•1 


1 T " t 





2 


3 


4 




i 

X 










X 


1 




X 




x 

1 A 






* 

1 




x 1 


X" ' 






• 






x 


X 






X. 




-i— 1 




K 


X 


1 




X 


X • 




1 


1 




I 

i 



TEACHER 

CHECKLIST 

4 



r- CM 



o 



O © 

Lo *<? 



00- ^oo ^ 



i 

! x 


i 

X 1 


i 

X 


X 


* 


X 


X 


X 


X ** 






X 


X 




X 






X 


X' 




X 


X ' 


X 


X 


X 




X 






X 


X 




00 

S 3 



-1 



79 



0r\\y'22% of the- t'eac^er^Jfelt th^students reacted positively to the 

testing, Subtle -79% felt the stWnts understood the purpose of the te,st and 

88% felt ttiey generally behaved well. Only 44% felt their students tried 
their best. .' 

' * ■- - / 

District personnel reactiona to testing-. All, of the Title I directors 
felt *the testing was'- worthwhile although some said there was too much testing. 
Only 11% fe>t that the- te^t results were used for anything other than t<S 
^^^-cofppctf* gains'.. TJiirty-sfcven^rcent of the directors did not know'if test % 
# results were discussed with' parents, but all of 'the other directors resp^hded . 
i than teit scores were individually discussed with /he stdBendts and/or 
> parentVj ' • ' / V 

.QA*the pr>nQ.ipaJs^-86# felt-the testing was worthwhile. Only 33% said - 
» they Gsed .the ;tesjb Cesiilts^Hnfeny way other than to compare gains, while 83% 
said the results were ihdivitaal ly *discilssed- Vith either the students, or 
parents* . \ ' .'7* 

s ♦ r 

. ' 'The Jeactiers de-f fniteV ^ad'the most negative reaction to the testing, v' 

Only 3"8% of the teachers felt that the testing was worthwhile, only 29% said 

they made ffriy special use of the test scores f and only 44% saitj they actually 

discussed the<test score* with the parents of students. Many said they would, 

* • how^/er 4 , |fHhe parents ever Pothered tpfvis^t^pr. inquire/ ^ 
• • ' * 

• Selection of students fyr TUTeJ programs. Th^ Title I directors were 

..the .best informed about the process involved in selecting* the students'. This 

generally involved a screening test and a- teacher equation. The 'directors, 

principals, and teacher^ Unanimously felt^aU'the^ select ion process*, 'resulted 



\ 



in Selecting the r-ight students 



0 

ERIC 



1 i 8 3, - 



80 

Only one district reported- not keeping the selection and pretest 
different because they did not feel this was necessary, ^e directors,, 
principals and teachers could\only make guesses ranging from 0% .to 50% as to 
the percentage of students that were new move-ins, testing out of level, 
and/or transfers. Interestingly, no one knev* if or how^many students were 
tested out of level . 

The biggest discrepancy between the three personnel levels concerned how 

new rove-ins were selected. The answers dfffered more qualitatively than 

quantitatively. ^The directors' responded with the proper technique, i.e'., each 

new move-in was tested on <ft^ or more selection measures and evaluated by the 

teacher; then the results were weighted and it was determined whether the 

child met the established criteria. The principals responded with what they 

liopncf practically happened^.' Facfi new "child w^s given a tesfwFfich ""determined"" 

• * f 

whether thay would be in Title I. The teachers explained what really 

happened. Sometimes, the ipew students were given a test, but ge^rally after . 

two weeks the teachers themselves knew whether or not the student should be\in 

Title I and the test was frequently not given. 

Test administration.' Title I directors unanimously agreed that tests ' 
were given oh the empirical norm dates, that students who were absent cjuring 
testing were tested^ immediately upon their retting and that th£ students always 
got make-ups. Only one district extrapolated test scores and most of the rest 
were not certain as" to wheo or why this was necessary. Again, there was 
conf usion«ps to what-^how often, and why they gave out-of-level testiag. 

Every district gave $ different reason for selecting their particular 
type, form, arid level of test. Some of the reasons were: easiest t,o 
administer, familiarity, one that a/lows half-year testing, convenience and' 



. Table 9 '* 

^JdW.ISTTlJ ^ TU1e 1 D1rect °^tle I Teachers and 
AldfcAand Principals to Open-Ended Interview Questions " 
v^-^ Which Could Be Answered yesior No ' 



Question 



D Do students feel positiv e about testing? 
2) Do students,understand ^e purpose of the testing? 



3) Do -students behave well during testing? 



01 rectors 
- ("'»?,. 



4) Do students try their best du ring testing? 

5) Is achievement testing worthwhile? 



6) Attest scores used for purposTT^er than compliance with 



7 -^l^L^lL^ {0) 

83 (12) 



8) JltesT^ 4dm,n,stered "thln-g^eHnes f 0r e m p1Hcal^ 



50 (25) 

50 (so) 



100 (37) 



RESPONDENTS 

T7 



Pr1,n 
("■T"3) 



62 (0) 
62 (12) 



55 (so) 
100 lo) 



•13 



(O) 



" } XtrffiST 9lV6n " y SPe£,i1 PW«o. for test a*1n- 



100 (87) 



71 (o) 



12) Are foms and procedures for subrtttlny Wto SCT joodT 



71 (0) 



100 (50) 



100 (37) 



29 (12) 
86 (12) 



33 



(12) 



83 (o) 



Teachers h Aides 
(n«16) 



100 (87) 



50 (25) 



33 



(0) 



100 (87) 



100 (12) 



_22_ 
75 



38 



'29 



29 



(10) 
(0) 



88 (o) 



JO) 
(0) 



(0) 



50 (25) 



(0) 



information or knowledge to answer? " percent of P*°p1e who felt they did not hav/enough 



Rir 



91 



82 



school curriculum match, Northwest Lab's recommendation, and literature 
• review, . • 

V 

f Seventy-one percent of the title I directors thought that special effort 

was made to prepare the students "orteachers for testing while only 50% of the 

principals and teachers felt tffat the students received - 1 ittle or no 

%-epalation for testing. Furthermore, only 25% .of the principals and teachers 
j » 
felt- the^teac hers were given any special^ preparat ion for the testing. 

Procedure's- for submitting data, to SEA. In, all instances the Title I 

directors are responsible for submitting data to the USOE, and this year's 

reporting- format is thought to be far better than that of the previous year. 

The accuracy of the testing data is supposedly ensured by random*error checks 

in some districts, while in most it is trusted to the teachers and only 

glaring errors are noticed. None of the disttfcts had any evidence *that. . 

random error checks were actually done. 

Summary of open-ended interviews. . The results from the open-ended. 

interviews are summarized in Tables 9 and 10. Table 9 shows the percentage of 

people in each category who felt they had enough information to respond who 

answered each question affirmatively (the numbers ,in parentheses show the 

percentage of people who said they did not know enough to a/iswer). Many of 

^ these questions are directly related to the validity of standardized testing 

in general and specifically the implementation of' the TIERS models. 

Although tljese data indicate that there may be some minor to moderate ' 

- violations of TIERS assumptions regarding Model A-, the lack of agreement among 

LEA personnel on many of the^quest ions ' is aTso disturbing. In addition to 

questions which could be answered "yes" or "no" a number of other 'questions 

•were asked' regarding pn6cejliires and opinions regarding TIERS implementation. 

The most disturbing f/ct/ about this information was. the lack, of agreement 

among various LEA stak^/members about what was really talcing glace. Table 1Q 

iJJlA "■ • 92 . 



•1 



Table 10 



Agreement Among Sctjool Personnel On 
Open Ended Jntervi ew Questions 



Question 


P/p* 




D/T* 






• A** 


DK** 


-A** 


OK** 


• A** - 


DK** 


1) How are the students, selected for Title I programs? 


42 


12 


25 


'75 


-.66 


25 


: : 1 ' 

2) What percent of students are tested ou^f leven^ 


0 


75 


• . 0 > 


^ 75 


100 


87 


: — : — t : 

3) How are new move- ins selected? 


25 


12 


40 


50 


63 


62' 


' — — 7"" ■ — — — 

4) What is the percentage of new move-ins in the district.? 


66 


62 


0 


75 . 


50 


50 


5) What is the* percentage 6f turnover in Title I programs? 


. 0 


87 ' ' 




100 


0 ' 


87 


6) What percent of the students new get a make-up test? 




100 . 


i _ 


100 




ido <. 


7) Who is responsible for turning data tnto USDE? 


.100 , 


75 


0 


87 • 


0 


- 76 


'^r k . ^— : : : — - 

8) How did you 4£lec^ your particular achievement test? 


100 


87 


c 


1Q0 




100 


9) What percent of students m^ss testing? < 




100 




. 100 




100 


10) 'How is accuracy of data checked? H • : 


• 100 


87- 




87.- 


8% . 


75 » 


' - AVERAGE AGREEMENT 


54* / 9%- 40* v ^ 
— r ■ — ■ — \ 



*HffP = Percent agreement between Direct rs and Principals 
*0/T * Percent agreement between Directors and Teachers ' 
*P/T » Percent agreement between Principals and Teachers 



**A 
**DK 



Agree / 

Don't know or not' obtained 



x 



93 



\ 



ERJC 



9i 



. 84 

shows average agreement between directors and principals, directors and, 
teachers, and principals ^nd teachers. Nufjibers in each cell represent the - 
^percentage who were' in agreement, 'while the numbers in parentheses indicate 
the percentage of time one or both of the p ^ir ^d id 'not know enough to respond/ 
The low levels of agreement among LEA staff on these items make definitive ' 
.conclusions about what is really happening -difficult. The preceding narrative 
about each of the areas discussed in the interviews represents the project 
staff's. best estimate and indicates some moderate problems. 

The Jow levels of knowledge of agreement among' LEA staff raised 
additional questions. For example, only 13% knew" if test' scores were 
extrapolated (question „#9,' Table 9), and only 21% knew how many students were 
tested-Qut of level (-qg^stion #2, Table 10). Only 42% of the school personnel 
agreed as to hov^yiew move-ins were selected (question #3, Talkie 10).. Overall, 
Title I directors and principals had' the~greatest agreement, about the testihg 
procedures (54%--perhaps, because they are more f ami liar* with what the 

procedure^ should be and thus answered "correctly"). The Title I directors 

u 

"and teachers were least in agreement (9%--perhaps because the teachers 
responded with what actually happened art the school). 

• > . , 

Determination of On-Task Behavior During -Testing 

Table IT shows the average percent of time that teachers and students 
were on task during test administration/ "On task" was defined forS^oth r 
teachers and students according to discrete, .observable behavior. For- 
example, movement for a 'teacher was "on, task 11 if she was~standing in front of' 
the room or pointing at nonattenderj or. providing a pencil, but was >"t>f f task" 
if she was standing with her back to Jhe students' or where students 1 faces 
ccnfldn't be seen,' or sitting down at her'desk correcting papers (faring the . 



Table !1 

Percent of Time Students and Teachers Are 
On Task During Test. Administration - 



TotaV 

Grades 1 and 2 

* . Grades 3 and 4 

Grades 5 and 6 



Students 
76.5 (n = 32) 
75.3 (n = 15) 
75.3 (rv = 12) 
82.7 (n ^ 5)- 



Teachers 
52.5 (n'= 12) 
70.0 (n-= 5) 
41.7 (n = 5) 
35.5 (n = 2) 



Table 12 



- ; 



Percent of Time Students and Teacher^ 
are Having Contact During 4^ 
Test Administration 



Total (n = 32) 
1st & 2nd Grade (n *= 1'5) 
3rd ,4 4th- Grade (n = 1 2^ 
5tli * Grade (ri = 5) 



X Based on 
Student Obse rvation 

- - — . — - -t— 

3.5 " ■ 
' *\ 5.1* ' • ' 
1.7 
2.9 



X Based on 
Teacher Observation 

5.9 

J 8.9 
•■5.7 
0 



86 



test. -Oo-tdsk and. off-task behaviors were defined based on suggestions in 

publisher's lest administration manuals and textbooks on standardized group 

achievement test administration.' 

- students were on task about 75% of the time and teachers were on task 
« only about 52% of the time. The percent of. on-task behavior^increases for 
V students in the higher grades slightly and decreases for teachers in the • 

higher grades. 



Quality of Test Administration * 

Table 12 shows the«mean percentage'of time which students and teachers 
have contact dyring the standardized test administration/ The discrepancy 
between the estimate based o^l student observation and teacher observation is ' 
because only five Title I students out of the total class were observed but . 
during the teacher observation, contact with any student was counted. 
Table 13 shows the results -^f a dicWomous designation of various activities 
which teacher? -should be- ^ing -before and during test administration. As can 
be seen, many of these items are done infrequently with some of the most 
•important itferte (e.*g., under s»dent preparation see #4 . . ; "Explain the 
reason for the' test." ) being done least frequently.- Items on the checklist 
were taken from publisher's recommendations and standard textbooks on test 
administration * ^ - 

%Match Between Curriculum Emphasis and Testing ' 
Table 14 show* -the percent qf time Title I teachers report they spend in 
class teaching fi\/e areas of reading skills and the importance they place on 
these skills. Table 15 shoys the emphasis that the reading ^subtests of six * 
standardized achievement*tests place on the same five skill areas as ,< s 
determined by the percent of questions directed toward sampling a student T s , 
skill in each of the areas. Table 16 shows th^differences in emphasis on 



Table 13 



Deqree to Which Recommended Practices in Standardized 
t Tes^ Administration Are Followed During Testing J 



DID THE TfACHER DO THESE BEFORE / 
. ADMINISTERING THE TEST 7 ■ / 

Class Environment . n = 38 

Arrange the students' desks so they are not touching. 

Position the desks to face the same direction (evpry boi 
student's face can- be* seen from the front of the roonr). 



68X 


1. 


, 58X 


2. 


78X 


3. 


56X 


4. 


84% 


5, 


94X 


6. 


69X 


7. 


40* 


8. 



Post a "Testing, Do Not Disturb"" sign on door. , " 

Have- a visible supply of pencils. 

Have a visible clock or%watch with a minute hand. 

Create a generally positive climate that promotes good* work habits and 
is without pressure or tension. 



Student Preparation * . 

*• Provide an opportunity for using the bathroom, drinking-water, and 
sharpening pencil. 



97 * 2. Provide all students witfci a pencil and an eraser, 

_59%_ 3. Ask students to remove norjtestfng material from desks if appropriate. 

J*l*_ 4. Explain the reason for the test (to gse the information to help teach 
students)/ t , 4 

Jlt_ 5. Obtain the attention of the entire* class for 1 minute prior to 
directions (all students watching teacher). 



6 Pass out te'st booklets in less than 2 minutes and in an order>y and 
efficient manner: 

^9%_ 7. Verbally reward attentive behavior. 
Reminders 

_iOX_ 1. Not to leave their seat* but to raise a hand if something is needed. 

_47*_ 2., What t6 do if they finish before time is up (11 only), 

_27X_ 3. To check their work if they finish before the time is up (to see if 
every question is answered only once). , , 

— 1 QS — v 4 rhat so * e «f the items will be more" diff lcul t than their daily wort . 



10X 



5. To sUp an item that they don't know and go on to the next one. 



88- 



( 1 

T^Ie 1 3 (continued) 



Did the teacher do these WHILE administering the test? 
PosUive Atmosphere .. 

$2% 1. Praise individual students for appropriate behavior f 
c - • » \ 

_rr_L 2. Praise class^for listening and«working. 

3. Smile-freqjentlJ. . 

89 * Mak£ less than two reprimands, threats, or criticism? during 'the subtest. 

.95* 5. Speak wifh a gentle, \bu t jferm voice. 

75 * 6. Use physical touch to.prowWand reward^on task behavior. 

2®L 7 - Start the test directions within several minutes of sitting down so thet 

, students 'did not become restless with preparation activities, 

'8- Quickly supply a student with pencil or eraser when ne§ded. 

58 * 9. Stand near front of room where all students can see easily. * j 

Read ing ftirectio^ 

t, class between sentences. 

the, class to check if directions were followed (i.e. "Put your finger 
Sample?/ "fin in the circle," "write your name," M turn to page 12 M ) # . 
aide to nonattenders. ■ 
next directiorv.oj^- after all students ire ready, * 

:nt'printed directions with verbal and visual explanations when 
do not understand the procedure. 

bange'wording of- directions to a vocabulary the students are familiar with 
(i.e. "circle" instead af "oval M prj "box" instead of "frame". * 

Re^ -g Test Items 

Z*%_ 1- Look up aiter each question and gfaffc* around room. 

2 Follow the exact wording of Questions as stat* in the manual (Never 

define or explain words or illustrate procedures.) 

67JL 3. Allow approximately 10 seconds between iters. 

50<L ? Never repeat a question unless .the directions specify to-do so. 

__3ft_ 5. Alert aide to nonattenders or to students with' raised hands. 

Ti. 1 Tes. ts 

— (. — t 

S2t 1. S*t clock for correct time requirenent. m 

72jL 2 Watch students during entire test to detect speeding,** low answering* 

day dreaming and cheating. , ' 

50 * 3 Alert aide to nonattenders or to students with raised hands. 
End of Test . • , 

53% ^ 1. Praise students for working hard. 

841 2. Co,llect booklets in a directed manner. * 
56i. 3. Provide a directed, stand-up* rest period. C 




9'J 



TaMe 14 



- /• 



TEAOER RATIffS OF. THE IffORTANCE.OF VARI03S CONTENT AREAS ASSESSED BY STANDARDIZED READING TESTS 



• 

, District (N) 


Teat 
Deed 


* - X Time Spent on Each Area 


Importance of Each Area 
(1 - high; 5 * low) j 


Phonica 


^Vocabulary 
• 


Literal 
ConprehenaJon 


[nferential 
Conor eh en a ion 


Structural 
Analyaia 


Phonica 


Voca&ulary, 


Literal 
Coaprehenaion 


Inferential 
Canrthenalor 


Structural 
Analyaia 


1 - S. frvijt 42) 
«| (2nd & 3r«T grade) 


ITBS 


-25 


30 

— ■- * 


13 ^ 


4 13 


18 




2 




5 ; 


3 


2 - S. Sanpete^ (1) 

(2nd nrnde) 


G-M" 


10 


10 


35 


35 


10 




1 


4 


5 


3 


At 

3 - S. Sanpete . (1) 
(3r<f gradaX 


. G-M 


5 


10 


35 


40 


10 


5 


3, 


, 2 


1 

1 


' 4 


4 - Provo * (1) 
(2nd 4 3rd trade) 


SAT 




20 


20 


0 


20 > 


' 1 


2 


-7* 


5 


4 


5 - -Morgan (1) 
(lat grade) 


f # * 
6-M . 


60 


20 

I — *- 


7 


3 


10 




2 


4 


5 


3 / 


to 

6 - Duchesne * (2) 
(3rd 4 4th urada) 


'.CAT 


30 


. 30 


13 


7 


" 20 ' 


1 


2 


• 3 


5 


' .4 


7 - Box Elder (2) 
(2nd 4 3r<T trad*) 


ITBS 


^ 35 


25 . 


25 


• 5 






2 


3 


5 . 


4 


, 8 - Alpine (4) 
(2nd grade) 


SAT 


15 


30 


10 


5 


40" 




3 


4 


5 


• 2 


9 - Murray -fl) 
(2nd grade) 


Wpodcock 



90 


; -10 


0 


0 


0 


* 1 


. - 2 


3 






10 - Murray (1) 
(4th greafe) 


v. 

Woodcock 


50 


50 


0 


0 


0 


A 


2 




5 




11 - Salt Lake City (3) 
* (2nd grada> 


SAT 


37 


20* 


17 


8. 


16 




2 




4 




12 - Salt Lake City (2) 
(3rd grade) ' 


SAT 


20 


25 . 


20 


10 


25 




2 




4 




13 - Salt Lake City (3) > 
(4th grade) 


SAT 


15 > 


16 




20 


16 




3 




2 




TOTAL 




.33 


23 


17 


10 


•If 

— * — 1 


I* 


2.2 


3.3 


4.4 


3.4 



CO 
40 



ERIC 100 



101 



Percent oWtems Devoted to Various Areas 
of Reading for Six Standardized Tests 



r 



90 



— L - " 5 — 1 


CAT 


SAT 


WOODCOCK 


ITBS 


GATES-M 


SRA • 


Vocabulary 1 ■ 


21 


20 


f 20 


20 


' 80 


28 ' 


Comprehension (both literal' and 

inferential J • i 


29 


47 


24 


—I ~ 

46 


• 0 

• 


28 • 


Phontcs 


■V. 

' 35 

1 


33 


' 56 


34 


20 


22 


Structftral Analysis 
<• 


15 


. 0 


0 

1 f 


0 


.0 


0 



' 3rd GRADE / , 

^ ! • 


CAT • 


. SAT 


W00DC0£K 


ITBS 


GATES-M 


SRA 


Vocabulary' 

* ; 


-21 


19- 


20 . 


41 


51 


54 . 


% Comprehension (both literal and 

I . inferential ) 

* * 


.37. 


, 48 


24 

*a 


59 


t 

49 


46 


Phonics 


27 


33 


. 56<r 


0 


' . 0' 


0 


* Structural Analysis 


". 15 


0 

* 


0 


0 


0 

' 1 

1 


0 

• 



J 



V 

I- 



102 



-£1 



~ Table 16 

Differences in Emphajsis of Skills Taught by Teachers and 
Sampled by Standardized Tests for Grades 2 and 3 



District 


Differences 1n Percent of Time Devoted 


Average 


•-Differences 1n Importance . f 


( Phonics 


Vocab. 


Compre. 


5. A* 


Descrepancy 


Phonl cs 


Vocab. 






S. Summit (2nd)* < 


-9 


+ 10 


-20 


+18 


14.3 


+1 


+ 1 


-i 

\ 


+ 1 


S. Summit (3rd) 


+25 


-11 


-33 " 


+ 18 


21.8 


" 


0 


-3 


+1 


S. Sanpete (2nd) 


\ 

-10 


-70 

* 


+ 70 ' 


+10 


40.0 


0 - 


0 


0 


+1 


S. Sanpete (3rd) 


+5 


-41 


+26 


+ 10 


20.5* 


0 


-1 


+1 


— i. 


Provo (2nd) 


+ 7 


0 


x27 


+ 20 


13.5 


+ 1 


+1 


- -2 


o • 

i 

\ 


Prtfvo (3rd) ' 


+ 7 tt 


+ 1 


-28 


+20 


14.0 




+ 1 


-2 ' 




Duchesne (3rd) 


+ 3 


+9 . 


-17 


-+5 - 


f- 5 


0 % 


+1 


-1 


0 1 


Box Elder (2rfd) 




i 


-16 ■ 


+10 


8.0 


+1 


+ 1 




o B 


Box Elder (3rd) 


+35 


-p-r 
hi* 


-29 


*10 


22.5 


'+3 


0 


-2 


0 ! 


Alpine (2nd) 


-18 


;+10 


-32 


+4CV 


25.5 


+ 1 


Q 


.3 




Murray (2nd) 


+34 


^ — , — 


-24 


0 


• 17.0 


0 


+1 


-1 




Salt Lake City (2nd) 


+4 




-22 


+18 


' . 11 : o , • 


> 0 


-+1 


-3 


0 *< 


Salt, lake City (3rd) 


-f3 


+6 


-18 


♦25 


r — 

IS. 5 

* 


' .-2 


♦1 , 


-1 


0 



•Structural Analyses 



( 



The peVcent difference scores are^omputed by subtracting the percentage of test Items devoted to sampling v . 
a reading skill a>ea (tabl e,»\5) from the percent of time the district spends teaching these skills (Table 14), 
Hence, a positive score means that a district 1s placing greater emphasis on teaching a given skill area than 
1s sampled by the test Indicated by the number of Items on the$est pertaining *tjp that skill area. 



• The differences of Importance scores are computed by subtracting the rank ordering of. perceived Importance 
of skill Y eas by •Title I teachers (Table 14) from the rank ordering of Importance of the skill areas' 1n the 
standardlzV^Tsts as determined by the number of test Hems sampling each sfc'111 arer: Hence, a positive 
score means the district thinks a certal n ^skll 1 area 1-sj*o>e Important than 1fid1cared by the test. 

i 

The average difference between Instructional emphasis and test emphasis was not used to compare districts. 

The greater number 1n tnjs column the more descrepant 1s the district's Instructional emphasis 'from the emphasis 

suggested by the number pf test Items. - r - 



' < 92 4 



skills taught by teachers and sampled by standardised tests.- Although the 
techniques used to col.lect these data are Admittedly somewhat" crude and based 

f I 

on-Jimited numbers of teachers in each district, nonetheless, the results 
^suggest that some schools 'are using tests which are completely inappropriate. 
For example, the third grade teacher in South Sanpete spends 73% of hec time 
on comprehension ^and ranks it as the most important of the reading skills. 
Yet, the test used by the district does not include 'any items designed to 
measure reading^ comprehension, 1 j p 

Discussion ' 

The results of the data collected on the field visits indicate that' 
districts can do much to improve the preparation for and implementat-i%i of 
standardized achievement t^sts used in conjunction with TIERS. Of greatest* 
concern, is the apparent lack of awareness of proper testing procedures among 
t school personnel. Even in those districts in which programs appear to be 
properly implemented, the lack 6f agreement and or lack of knowledge* among 
personnel as to what the program is actually doing raises substantial 
concern, ' . % 

The second area of concern is the appropriateness of -the standardized 
achievement tests selected by each school district. Only one school district 
chose a test because of data suggesting its suitability. Most districts 
selected tests because of price, convenience, or* habit: If 'all . achievement ] 
tests w£r,e equally suitable, th'ese would be excellent determinants of test 
selection. However, there is strong evidence that the differences in content 
sampled by the tests arrd the differences in the format in which the tests are 

A 

administered (see Chapter V) produce* significant differences in scores 'that 
cannot be accounted for; by normative scaling, .In some school districts and 



erJc • - loi 



' 93 



grades, there is (/ a wide diversity between what is taught- in the classroom and 

■if-. 
what is sampTed by the achievement test used in that district and grade. 

t 

A final concern is^the performance of the individual teachers before, 
during $nd after testing. . Ln many districts opportunities were not afforded 
for the teachers to be adequately preparedor familiarized with the t£st and 
testing procedures. The, data collected during the visits to LEAs suggest th^t 
standardized testing procedures are frequently violated and 1 both teachers and 
students demonstrate fairly tngh levels of off task behavior. In ^summary, 
awareness of TIERS guidelines needs to be improved, better communication is 
needed between Title I directors, Title I teachers and principals, better 
criteria for selecting tests should be adopted, and efforts are needed to 
assure that standardized test administration procedures a,re used. 



V 



1 



( 

/ 



94* 



CHAPTER V s . 

/ ' 

THE -EFFECT OF ITfM FORMAT ON STUDENTS STANDARDIZED 
REAQIN6 ACHIEVEMENTxTEST SCORES 

Cronbach (1971) defined tjf\e universe of behavior "Sampled by a test as 

including all abilities required to perceive the, stimulus items and' formulate 

responses.. For an achievement test to be a valid indicator of what a student 

knows, it must* be assured that the'stydent gets answers correct or incorrect 

because of what he or she knows. Jf variables besides the student's knowledge 

content affect the student's score, correct interpretation of the results 

becomes more difficult, and one must question the validity of -the test for 

* * 

makins decisions that depend on tha* student f s knowledge (e.g., educational 
placement and instructional programming decisions). > r * 

, During the preparation for the on-site visits t6 various districts 
described ,in Chapter IV, it became clear that the format in which a 
standardized test question was asked, was oie such factor that might influence 



stanc 
test 



scores^aM confound what a student appears to know. Consequently, a 

component was *added to the wprk scope of the project to investigate the effect 

.of item format on students 1 standardized reading achievement ^test scores. 

This chapter describes the rationale fdr^the study, the procedures and 

results, and theSimplications for the TIERS. • In addition to the original' 

study, the. results of a follow-up replication of the study with Title I 

^ students in Texas are included. Although the follow-up study was not 

conducted with project funds, the procedures for the stgdy were developed 

during the project and the results of the second 'study underscore the 

f * 
importance of format differences in interpreting test scores. 

\ 

ERIC \ 106 



u Previous Research 

Previous research has investigated a number of factors which may affect 
how well a student .does on standardized achievement tests besides what the 
student knows. The research ^summarized briefly below considered variables * 
inherent in the tests themselves apd the way, questions were asked as opposed 
to student characteristics (e.g., Ituws of control, motivation, anxiety) or 

f 

'setting -variables (e.g., time ofday, student/examiner ethnicity match) . 

Order, of Test Items > J 

Whether test items are presented from easiest to most difficult, most 
diff icu!1?*to easiest, or in scrambled order may affect the students' test 
scores. Barciskowski and Olsen (1975) reported that students , feel tests are 

harder, when the test items are presented in a decreasing order of difficulty. 

I 

Towle and MerMl 1-. (1975) showed that the order of presentation can affect the 
anxiety level of the test' and thus influence performance. Some researchers 
have reported that increasing item difficulty elicits better performance 
(Hol'liday & Partridge, 1979),' while others have found no effect at all (Gerow, 
1980; Dambrot, 1980). ' 

Test Item Format « f , v 

Millman and Setijadx (1966) showed that* Indonesian student-s perform 
better than American students on open-ended math problems -while American 
students perform better on multiple-choice problems. Evidently, the format in 
which questions were asked was a determinant of students 1 scores. 

V \ 

• In some instances the format of a test item decreases test scores by 

* * , 

demanding additional intellectual skills to answer the' question. Poage and 

Poage (1977) found that math questions containing pictures were ^missed npre 



96 



often than comparable questions without pictures on the Michigan State 



r. 



.Assessment Test. They hypothesized that items with pictures resulted in lower 

r • 

scores because they required cognitive capacities involving the\use of 
' perspective that were not fully developed in younger children. In a related 
study, Kierscht and Vietze .(1975) investigated whether younger children 
perform Better on questions tfrat use three dimensional objects as the test 
stimulus rather than two dimensional pictures., Kierscht and Vietze (1975) 
found that preschool children performed better on the Slossen Intelligence y " 
Test which uses objects, than'ort the Peabody Picture Vocabulary Test which 
uses pictures. ' * i , . / 

Research has also demonstrated that students perform better on multiple 
choice than on short 'answer type question format covering the same content* 
(Kumar, Rabinsky, & .Pandley, 1979; Loftus, 1971). Est'es.and DePolito (1967) v 

f suggested that' such differences occurred because the short answer format 
required the student to recall the' information from memory, whereas the 
multiple choice format only required the student to recognize thfc correct ' 
answer. 'Kintsch (1970) similarly suggested that recall items require*} the 
student to' both search for and retrieve information, whereas recognition items' 
only required the student to discriminate between the presented information, 

% Halpin and Halpin (1979) demonstrated that students performed best when the 
type of test content w£s matched with the. type of item format (i.e., when 

9 

recognition item formats sampled concept level questions; -and recall item 
• formats sampled^ knowledge level questions), 

Frisbie (1973") hypothesized that "multiple choice questions limit the 

r 

universe of- comparisons" becaQse they only require an individual to recogriize 
"the correct response. When answering" a true-false question, hpwever, an 
individual must eliminate the incorrect response by forming a counteV example 



9 

ERIC 



10 




k\a/ ■ ■ ' % . • ■ . , ; * 97 

from a wide urtiv.erse 6 of possibilities. Benson and Crocker (1979) ' 

•.administered multiple-choice, true-false and matching type questions of ■ 

identical content to students arid also found 'a statistically significant 

^ difference in the scores f'orjeach item format. Students scored highest on •• ' 

the matching questions andr worst on true-false questions. 

• • . . . \ 

guest i oo and Answer Form ' . ^N* 

~ • ■ ' ' * ( : 

All test items require that a question & asked and an answer t>e given.' 

. The way in which 'questions are asked 'or answers given'-vary from (test to 

test. For, example,. Table 17 displays. four different question and answer • 

formats from the word analysis subtests of three second grade' standardized • 

•achievement^tests. Differences in the" form for either questions or answers ' 

may affect Wie scores students receive on standardized' tes^^ 

• Johnson,, Pittleman,. Ackwenker and P>rry (1978) administered the . ' ' 

* . * , 

vocabulary subtest of the Metropolitan Achievement test with three different 
types of questfJn forms. The question forms were' synonym, ■ synonym in 
context, and cloze. The answer form in all cases was multiple choice.. The 
students performed best on the synonym format in which they were^prWented a . 
stimulus word .and asked to find the response alternating which 'was closest 
in meanvng to the word. The synonym in context fibwat was of intermediate 
difficulty. . In this form, the stimulus word was' imbedded in a sentence 
Students scored lowest on .the cloze format, -which is a. "fill imthe blank 
format". .In a nonempirical' article; Roid and. Hadaynna (1978) concluded that 
multiple-choice item stems sampled content bejt when the item stems had - I 

fixed syntactical structure (as found in synonym in context formats) and 
when adjectives and verbs, were used rather than other parts' of speech as the 
•target words in cloze tests. ' 



ERjC . .'. ^ 101) 



Table 17 . " * 

Sample Test Item Formats Taken From the Phonics Analysis * 
Subtests for Three Standardized Achievement. Tests 

- • ' ■ \ : ■ 

FORMAT #1; * Stanford Achievement Test, Level II- Form A. 

"Mark under the word CAT ." 

4 . . f f ** , 

CAR ' ^ CAN CAT 

. . a , o- o 

: : 

^FjlRMAT #2: .California Achievement Test , Level .12, Form C ^ 

"Find^the word with the same beginning sound as the • 
word CAT .", 

; ' - * " - 

CAN TAN TACK 

v 

FORMAT #3: ' Stafford Achivement Test , Le\fe1 II, Fotrn A. 

"Mark. the l^ace under the word 'that has ths same sound 
: as the underlined sound." * CAT ■ 

• CAR ' SAT ' . NAP 



I ° 

«• """ '**"T'" 
odcock^Reading Mastery Test 



ioe* this word sound."' CAS 

— -r 'fi- - — 



Ho 



99 



, ■ Changes in ~the 'answer Vorm may also affect ttpe child's score on an 
•achievement test. Grier (1975), Weber (1977) and Catts £1978) found that 
ddditionardistrdqtors decreased what the child appeared to know, Williams, 
Davis, Anderson and) Favor (1978) and ^ai-Min and Coffman (1974) administered 
the original and & modified' version of the Iowa Test of Basic Skills to 
elementary schoo.l students, Williams, et al. found that an additional "all 
incorrect" distractor decreased scores, and suggested Un's decrease was 

% 

/attributable to, students' ur>f ami liar ity with that format. Lai-Mtn .and Coffman 
added an "I don't know" distractor and also foi/nd that scored decreased, 
Weiten (1979) administered multiple-choice 'questions, with either compound or ( 
single response forms .and found that students scored lower on the compound • 
response form. 

Summary 

Achievement tests are -usually inferpreted as an indication of what , 
•children know .and are frequently <tfsed to make'important , decisions concerning 
the placement, advancement, and educational programming of children. If, as 
previous research has indicated; types of test item formats 1 influence what a 
Child appears to know, the types of formats used by an achievement test should 
be considered in seVSctinfg and Interpreting achievement test scores. 

Based on previqus research, it appears that how the , test items are 
organized, the way in which test items are asked,' and • the method by which they 
are answered all affect students 1 oerformance on tests. The research 

f indir/gs 



described herein expands the 



four 



f gs of previous research by comparing the 



effects on students' scores of foifr test item formats taken from three 
commonly used standardized reading achievement tests. 



ERIC 



in 



9 v 



100 



* r Method 

^ Test -Construction ^ 

Vfb^lftanford Achievement Test flevel II form A), the California 
Achley^ent Tj|s1>'( level 12 Torm C), and the Woodcock .Reading ^lastery Test were 
analyzed.- 4 Fourt different types, of question formats for testing knowledge of 
phonic sofi^l^^to identified (see Table 17). The content (knowledge of the 
phonic sounds)**^ classified into six categories: consonant sounds, yowel 
sounds, consonant digraphs, vowel digraphs and' diphthongs, controlled vowels, 

, r 

"and variant voxels. 

A CQntent bit was selected from*each of the six categories. A Format* 
Familiary Test ^FFT) was Constructed so that each of these s jx content bits 
were tested us*ing each of the four formats. In other words, identical content 
was tested* for each ^CiTd using four different formats. For instance, the 
students 1 mastery of the consonant digraph "th" was tested using all four 
format types .i Tllere were a total of 24 separate questions as shown in 
Table 18.4 4 Format^#4 was scored in two ways resulting in five scores for each 
phonic sound.'** First, format 4A was' scored correct if the student pronounced 
the, target, sound corYectly. Second, format 4B was scored correct only if the 
' student -pfooourtced the .entire word correct (thisjis the scoring procedure 
suggested by the manual). 

Procedures \ - * * 



„ The, Porcnat Fami 1 iary Test was administered to two-groups of students. In 
Study :I f - the, Fofrflat Test was administered to 37 second grade Title I students 
in May, 1980, The students Were from eight school districts in Utah.* They 
were primarily 'Caucasian and evenly split between rural and urban communities. 
In Stifdy*II, the Forroat*Test was 'administered to 31 second gradeiMexican- 



* Table 18 
Item Matrix of the Si* Phonic Sounds 
Tested with'Four Format Types 



1. Stanford Achievement Test Level \l 
« "Mark under the word » ." 



2. California Achievement Test Level 12 

"Find the word with the same 
beginning sound as the word _ . " 



3. Stanford Achievement Test Level II 

"Mark the space, under the w$rd' that 
has the same sound a^the underlined 
sound. " 

\J 

4. Woodcock Reading Mastery Test*. 
"How does this word sound?" 



CpHSONAKT 

SOUMO 



SHORT VOMCL 

SOUNO 



COKSORANT 

01 GRAPH 

th 



VCVU 01 GRAPH 

M\0 DirUTltONG 



.COKTKUCO 

voun s 



VARIANT 

VCMILS 



sow snow show 
0 0 0 


rug rig rag 

0 0 0 


*> 

torn horn thorn 
0 0 0 
* 

1 


L 

/* % 

sport* 1 spoot spot, 
0 0 0 

* * 


bran Lom bam 
t t 0 0 

i 


laugh lag leaf 
0 '0 0 


shake state snack 
0 0 0 


bad big nut 
0 0 0 


t 

thick tow torn 
0 0 0 


loud putt pot 
0 0* 0 


for fro* mark 
0 0 0 


told caught tall 
0*0 0 


snip 

ship snack sip 
"0 0 0 

# 


luck 

rut lick wop 

poo 


thin * 

tin then throw 
0 0 ■ 0 


about 

four cow above 
0 0 0 

i 


car 

corn ,in ran 
0 0 0 


caught 

laugh bought cow 
0. 0 0 




rug 




fout 


pard 


sauf 



o 

ERIC 



113 



111 



* . , " 102 

American Students from a single school in Laredo, Texas in January, 1981. In 
both studies, the teachers were asked on a pre-arranged visit to. supply 
several "average" Title I students. The teachers were told that the students 
would be administered a test to determine the effect on 'students^ scores *of 
various quest ion -and answer formats from standardized tests.' The students 
were administered the Format Familiarity Test in small groups of six except 
for the last section which was derived from the Woodcock and was administered 
individually. During test administration, each format section was preceded by 
abridged directions from the test from which it was derived. ^ 

Result^ 

' Study I 

The ceTl and marginal means of percentage of correct answers are 
presented in Table-19. A two-way repeated measures ANOVA indicated a 

» 

statistically significant difference in thejpercen^of students scoring 
correctly on format types (F = 32.41, ^<001)(see Table 20). 

Since the main effect for format was significant, the Newman-Keuls 1 
Multiple Range Test for Differences .Between 'Means was calculated. Group mearrs 
are ranked from Ingest to smallest across the top of the table and. from * 
smallest to largest down \ he side. Pairwise differences between the means of 
all possible pairs of formats are presented in the matrix. The results" are 
presented in Ta£le 21. The differences between all pairwise comparisons pf 
the means were statistically significant. 

Study II , 

Jhe cell and marginal means of percentage of correct answers for the'. 
Mexican-American* students are pVesented in Table 22. A two-way repeated * " 
measures ANOVA indicated a statistically significant variation iff the percent 
•pf students scoring correttly on'format types (F = 41.47, £ <.00l)(Table 23). 



Table 19 



Stjjdy 



Format 1 



Cell and Marginal Means and Standard 
Deviations of Percentage Scores on 
theFormat Test for Caucasian Students 



Eorrnafr 2 



Format 3' 



Format 4A 



rorrt 





J Ql on 

" (27/4) 


55.41 y r 
-(49.95) 


46.40 
' (49.8) ' 


1 

64.87 
(47.66) 


50". 90 
(50.10) 




Consqnant Sounds 
sn • 


100,00 

f on ' n\ 
* \ UU . U J 


- 62.16 
(49.77) 

* 


. '75.68 
(43.50) 


' 72.97 
(45.02) 


48.165 
(50.71 ) 




Short Vowel Sounds 
u 


97.30 

/If 4 0 \ 

1 16.43 J 


64.87 

i a n a r\ \ 

(48.40) 
1 


y 

70.27 
(46.34) 


91 .89 
f27.07) 


62.16 
(49.17) 

• 




Consonant .Digraph 
th 


91.90 . 
\c/.o/) 


83.78 
(37.37) 


24.32 

(43.50) • 


* 86,49 
1 . (35107) 


' 72.97 
(43.. 92) 




Vowel Digraph 


91.90 

(28.03) - . 

J 


« 

40.54 r 
(49.71) ' 


• 18.92 
(39.71) 


(48.71) 


CI 

Z7 1 . OO 

(50.67) 




Controlled Vowel, 

ar , 1 


• 

73.00 . 
(45.02) 4 \ 


43.24 
, (49.17) 

1 


' 59.46 
(49.44) % 


59.46 
(49.17) 


54.05 
(50.54) 




Varlaht Vowel 

• • . au . . 

a 


97.30 
(16.4*) 

I 


37.84 
(49.17) 


> 

29.73 
^41 .69) , 


16.22 
* (37.37) 

• 


• 16.22 
(37 K . 37) 





\ 



*4A and 4B are derived .from the Woodcock. 4A 1s scored correctMf the target sound U pronounced correctly 
4B1s scored correct only 1f the entire word 1s pronounced -correctly as suggested by the manual. 



o 



ERLC ! S 



^ > 117 



» 



Table 20 

Source of Variance for Repeated Measures 
ANOVA with Caucasian Students 



104 



r 


s 


t 






* 

• 

i Source 


i i ; 

Decrees df 


Sum of f 


^ F 


Significance 




F r PpHr^m 


squares 




Level 


Phonics 

^sr — 1 


5 


19.17 

• 


15.41 


.001 


Formats 


4 * 


29.12 ^ 


32.41 


• .001 


Subjects 


36 


21 84 
















P X F 


20 


20.48 


7.37 


.001 

• 


* - 

P X ft 


180 


.44.79, 




r*- * — 


F X R 




32.34 








— — ■ ii 






* 


> ■ 

P X.F X R 

* 


720 * 


'100.06 ' J 






TOTAL . * - 

• 

" — - ' ■ 



ERIC 



ii 



105 



Table 21 

The Newman-Keuls' Multiple Range TestV 
Differences Between Means of Format , 
Types for Caucasian Students 



Formal 3 » -.464 
Format 4 B .509 



r 



Format 4A\ .649 



Format 1 



<V ** = p <.01 



919 



Format 1 
.919 



.270** 



Format 4 A , 

185** 
.140** 



J 



Formal 2 
.554 , 


Format 4B 
.509 


Format 3 
.464 


.090** 


.045** 




.045** 


* 




\ 













9 

ERIC 



no 



Table 21 • 

Cell and Marginal Means and Standard Deviations 
Percentage scores for Mexican American Students 



*M and 4B are derived from the Woodcock. 4A is scored correct if the targt*^ sound is pronounced correctly. 
4f is scored correct only if the entire word is pronounced correctly as is suggested by the manual. 



1 

> s 

9 




* Format 1 


Format 2 


Format 3 

■ -L 


Format 4 A 


-Format 4B* 






.57 • 
(49.39) 


* .38 . 
(48.52) v 


.41 

(49.29) 


. 75 
X43.80) 


. 58 
(49.48) 


Consonant 
Sounds 
sn 


.57 
• (48.49) 


.06 
(24-97) 


.84 
(37.39) 


.77 
(42.50) 


.81 

f : 0. 16) 

9 


.39 


Short Vowel 
Sounds 
u 


. .66 
(46.16) 


•914' 
• (24.97) 


.42 
(50.16) 


.71 . 
(46! 14) 


* 

• 71 
(45 68) 

\ l *J m UU J 


.4 -55 

* \ JU . D j } 


Consonant 
Digraph 
,th 


. 55 
(49.95) 


.83 
, (32.39)' 


.52 
(50.80) , 


.23 
(42.50) 


.74 
(44 


. >.45 

■J- 


Vowel 
Di 'graph 
ou 


.43 ' - 
(49.48) 


' .87 
(3#08) 


. . 19 
(40.16) 

V. 


.00 
(0.00) 


. 55 

('50.59) 


1 ■ 

.55 
(50.59) 


ControJ led 
Vowels 
ar 


.53 
(49.81J 


A2 . 
^50.16) 

c 


1 1 

.13 
(34.08) 


r- » 

.55 - 

(50.59) * 


.84 

(37:39) 


' • - .74 

(44.48) . 


/ — 

Variant 
Vowels 
au 


.47 
(50. 14) 


.38 
(49.51) 


.16 
(36.89) 


..19 
(40.16) 


.81 
(40.16) 


• .81 
(40.16) 

• 



o 



^ERIC2t) 



121 



107 



Table 23 

Source of Variaire/for Repeated 
Measures AN OVA for* Mexican 
- American Students 



Source 



Degrees of 
Freedom 



Sum of 
Squares 



Significance' 
Level 



Phonics \ 



5.16 



11\77 



.001 



Format 



Subjects . 



30 



16.48 



31.44 .001 



82.41 



P X F 



20 



' 46.37 



26.84 



.001 



P X R 



150 



13. l4 



F X R 



120 



15.72. 



P X F X R 



600 



51.83 



TOT A 



ML 



231.11 



\ 



ERIC 



122 



108 



Since the main effect for format was statistically significant, the 
Newman-Keuls' ^Julti^le Range "test for Differences Between Mearis s was 
calculated. TheVesuVtTlir^presented in Hble 24. Significant differences 
.between m^ans at \ <.01 ar^ indicated with asterisks. The differences between 
•the means of formats ,1 and^ 5 and 3 and 2 are the only no'nstatistical ly 
significant differences. 



Discussion . 

\ 

-The results of tffls study indicate tyat students score very differently 

on phonics items depending on the format used to present. the items. This 

* • ... / . 

means that cpncjusions about how well a student has mastered phonics content 

^ • * 

wjll depend in part upon the format of the particular standardized test which 
is u&d. » / J 

Since raw score percentage correct is used as the dependent variable in 
these analyses, these differences would- be meaningless if they were adjusted 
for io^i^test f s norming procedures. In other word^a difference between a 
score of w 50% correct on the f&rmat of Test A and 80% correct on rfe format of* 
Test B would be unimportant if the 60 percentile for Jest A was 50% correct, / 
and the 60 percentile for Test B was 80% correct.- As can be seea in. Table 25, 
test norms from the tests used' in the study do not account for the 
differences. t * 

In support of previous research, the students performed best dn those 
formats which did not require the use of sJci'Us and knowledge in addition to 
that required to answer the question. This is directly evidenced by the 
differences in scores olefin e'd orr formats 4A and 48/ Format 4A was scored 
correct If the targefsound was. pronounced correctly, whereas format 4B (the 
method recommended by the test publisher) was scored correct only if the 



9 

ERLC 



123 



• Table 24 

The Newman- Keuls/, Multiple Range 
Test of Differences Between Means 
for Format Types for Mexican' American Students - \ 

) ■ " 

















• I 


Format 4 


Format 1 


Format 5^ Format 3 


Format 2 








.586 


>581 '.409 


.376 


Format 






* 






2 

, — ^ 


.376 I 


.366** 

■■ — 


.210** 


.205*** .033" 


* 


Format 






<** 






3 


.409 | 


.333** 


.177** 


..172** " 




Format 












5 


.581 | 


.161** 


.005 : 

* 






Format 












r 


.586 | 


.156** 








Format 












'4 


.742 | 


0 









* = p < .05 
**'= p < .01. 



« 



124 * 



V 



Table 25 



Percentage of raw score 
correct 



-50th . 73% ■ 72% 



110 



Comparison o,f Nbrmatlve Seating of 
the Phonics Analysis Subtestsfof 
- the CAT and SAT at the End of - « f 

Second Grade \ 



Percentile ^ SAT ♦ CAT 



o 

ERIC 



125 • .A 



0 



111 



3 

'S^oi 



ehtire word was pronounced correctly., Format 4B required that the students 
know/additional content besides that^of the target content. 

.The Mexican-American students scored lowest 'dn formats 2 and 3. These 
format^ required an extra cognitive step in much the same manner as do recall 
items versus recognition items. Formats 1 and 4 require that* the student 
simply gronounce or identify a word. Formats 2 and 3 require that the student 
discriminate bet^n sounds within a word, isolate the soun3, and identify 
that /sound in another word. This is a far more complex task. * - 

The Caucasian students' also scored poorly on formats 2-and 3. They 
. scored far better than the Mexican-American students on format 1 and wors^on 

* fonfats 4A and 4B. This suggests a disordinal interaction of format types 
witb ethnicity. This may be evidence that a student's fam'iViarity with >he 

. test's question format influences the student's scores on that test. All of 
the^exican-American students used Spanish as their first language. Much of 

• th^ir educational experience may have been learning the correct, pronunciation 

« -» 

.of English from written symbols. This is exactly the skill that format 4 
Samples and on which Mexican-American students did best. Correspondingly, 
mych 0 of the Caucasians 1 educational experience may have been recognizing the 
correct written symbols for words they already knew. This is the skill format 
r* samples a^ on which the Caucasian students did best. 

Since^the main focus of this study was to investigate whether different 
format types yielded different results when testing identical content, the 
statistically significant main effect for phonics content and interaction with 
format type' are not diseased further. The statistically significant 
interaction^may be due, in part, tp a sparse sapling of the domain of phonics 

This hypothesis could be investigated further if future research 
uses more questions from each phonics content category. • 



mc 4 m 



Although the results are based on a fairly small sample and the Format 

Familiarity Test made no effort to comprehensively sample the content domain, 

the results do indicate ttyat what a student appears to know is confounded with 

the' format of the test that is used. One explanation for this is that ♦ 

students may perform better on those tests which ask questions in the same \*ay 

% 

that is familiar to the student during instruction. In yJther words, using an 
unfamiliar forma^may well decrease the student's scdre by introducing an 
irrelevant variable and make it appear that the student has not mastered the 
materials well^s he or she has. ■ * 

'School personnel have traditionally been concerned with selecting^ 
achievement tests that sample the content being taught in the curriculum. . Not 
as much attention has been paid to selecting a' test that asks questions, abo^t 
the content in the s*ame manner as is done in the classroom. f These results ' 
indicate that the format in which questions are askedL is an important 
consideration for selecting an achievement test. ' ' ^ ' 



J 



' / 



127 



chapter vi 

' effects' On standardized achievement test performance of 
training teachers, training students, and motivating students 

L - 1 « 



•J 



A second component which was added to the work scope of, the 
v. "Ref inements" contract was a study to investigate the effect on 
standardized achievement test performance of: ^ # f 
1) training teachers to fol low standardized .testing procedures; 
'2) training students in test taking skills; and 
3) motivating students to perform well *on the' test. 
During Aug'ust and September of 1979, project staff visited a number 
of "districts in Utah to observe standardized test administration. The 
purpose of these visits was to prepare for the project's data collection 
efforts the following sprinq, and to assist Salt Lake District in 
implementing Models A and B correctly. During these visits, staff 
members observed that many tfest administrators did not adhere to 
standprdized test administration procedures and students appeared to be 
con fused about what was expected of them and/or uninterested in the test. 
Conversations with teachers after the testinq confirmed that many 
teachers were not adequately p^eppred to administer the tpts and dfd not 
see the time spent on testing as worthwhile. As project staff were 
preparihg the test booklets fgr automated scoring by the publisher, the 
number of stray marks, incorrectly followed directions (by both teachers 
and students) and generally sloppy or careless completion of thebooklets 
made it'Tle^f that npny people did not view^the administration of 
standardized tests aVan important activity. 

* These observations' reinforced the concept that other variables in 
addttion to wha£ a student knows may be Influential in determining a 

■ • . w • . 

128 



114 

studentjs test store. For Tvtle I students the possibility that ' 
extraneous factors are confounding what & student appears to krtow is 
particularly important for fwo reasons: 

1. because standardized achievement test scores are assumed to 

reflect bow much a student knows, the results of these tests are 

frequently used to jnake^educational placement and 'programming 

- decisions. If the test is not a valid indicator of what a* 
< * 

student knows, placement and programing decisions based on- the 

• > test jnay'be incorrect. • 

2. All of the TIERS models utilize the results of standardized 

achievement tests to measure the impact of Title I. In this" 

context "impact" is defined as wtiat a student learns which he 

• ^ i 

wduld notJhave learned had* the Ti,tle I program not 'been offered. 

To be a valid measure of "impact", regardless of which model is 

us^d, the tests utilized must measure student's knowledge 

withput being substantial ly* confounded with other variables. 

The importance of these issues to the objtfs^ives of the "State 

\Ref inementsw" project prompted tjie inclusionSf a discussfon of this added 

study in this Final Report.. Although not included in the original work' 

scoae, th? activities of this component were partially supported by 

project # funds. Additional funds came from a small Student Initiated 

•Research Grant ($6,801) from the Bureau of Education for the Handicapped 

(these funds paid for materials development-, travel and current expense 

^ not included in the Refinements Project, and data collectors), and from 

contributed resources and time from Utah State University. The district 

in which this research was conducted was one of the dfstricts included in 

the original "Refinements" project and many of the procedures and 

instrumentation developed for these activities were also utilized in 

• * ' 



J 



r ■ } "s. 



accomplishing the objectives of the Refinements contract. As- a result /of 

the symbiotic relationship between the original workscope an/j these 

extended activities, both components benefited, , and the government 

received "more for tl\eir mooey" than had originally been planned'. >. 

The purpose of the study described in this, chapter was to determine 

^e' effect of training students in. test taking skills, training teachers 

in standardized test administration procedures, and reinforcing students 

for high performance during testing. T^e effect of treatment conditions 

on four dependent measures 4 was' examined usjng a true experimental' desigg. • 

and data were analyzed by a -three-way (2X2X2) analysis of variance. 

The dependent variables were student reading scores, student' on-t ask 

behavior, teacher on-task behavior, and invH^d test booklet marks. 
§ \ 
* Procedures 

Subjects " 

Participants in the study (students and test administrators) were 
from the district's eight Title I schools (16 classes) and from four 
"similar" schools (eight classes)*. "Simi'lar^chools were selected by 
the Title I director as those most cJosely approximating the Title I 
schools in mean IQ, previous achievement, achievement test scores, and 
income lefpl. Analyses demonstrated that there were no statistically 
significant differences between the Title I schools -and the similar- 
schools on any of these variables (reported earlier in Chapter III - 
Table 2). 

0 

Permissibn to conduct the study in the Salt Lake City District was 
obtained from Dr. Darlene Ball, then Director of Title I for Salt Lake 



ERIC - 130 y 



/ 



(U6 



City District, and Dr. Stanley Morgan, Research Coordinator for the 

district. TFie study was endorsed by Mdurine McDonald, District 

Coordinator, for Title I, dnd Dr. Cy Freston, Utdh State Office of 

Education Specidlist in Learning flisabi 1 ities. The resedrch was dpproved 

'by the Institutional Review Boar0 at Utah State University. h * 

Notice was sent to each parent (see Appendix 9 for 'a- copy of the 

letter) describing the reinforcement procedures, the research rationale, 

/ 

and procedures for withdrawing their child from the study if 4hey so 

« 

desired. Personal individual contact was made with each principal in the 
12 schools to* explain the research procedures and secure the authority 
for treatment and data collection (see Appendix 9 for letters of support, 
approval and nofcif icat ion) . 

r 

Students , Second grade classrooms were chosen to participate in 
the study because it is freqyently at thts level- that group achievement 
tests are .first encountered. , A, bad testing experfence could negatively 
influence the students' attitudes toward future tests. Although all 
participating students attended ^strict- selected .Title I or "similar" 
schools, only 30% were actually l}t)e I "target 11 students. 

The selection criteria for classify^ students as target were 
different, though related, for thTTitle I schools ^^and the "similar"/ ' 
schools. In. the designated Title I schools, the identification^ a 
student as a Title I target student was based on students 1 spring 

■* 

performance on five key indicators: 

1. Reading subtest (SAT). 

2. Math subtest (SAT) 



l3 i 



4 



117 



3. Language subtest -(SAT) 

4. Teacher checklist- of behavior (locally developed 

5. Teacher evaluation of or-al language skills (locally developed) 
The students 1 scores on each indicator were assigned weighted point 
values. Stutents scoring below a locally determined criterion were 
identified as target students. In addition to meeting this selection 
criterion, students had to score below grade 1.8 on the reading or math 
subtests (SAT) that were given during spring of the previous year to b£ 
eligible for reading or math Title I programs. 

Since the , M simi lar 11 schools were not designated as Title I and there 
were no official target* students in the classrooms, it was necessary to 
develop ^criteria for classifying a sample of students in the "similar* 1 
: -s^ools~as unofficial "targets". The method used 'for identification was 
to select students with th^lowest 19% of the stores on the Spring, 1980, 
SAT in the similar SGhools. The 19% cutoff was determined as the median 
percent of Title I students in- the designated Title I> cl .ass rooms. 
Hereafter^ the term "target" will refer to the combined group of students 
.from Title I and similar schools. 

Of'the 597 participating students, 323 (54%) were male; 180 (30%) 
"target" students and 46 {7.7%) were mainstreamed mildly handicapped 
special education students. All special education students were also 
identified as Title 'I students and were included in the analysis in the 
"target" group/ The average number of students per classroom was 24.^^ 
with a range of 18 to 33." 



Test administrators . The standardized achievement test used as 

♦ * 

one of the dependent variables in the study was administered by 24 



9 

ERLC 



132 



J 



118 



teachers— 12 were Title I teactier leaders and 12 were regular classroom 
teachers. Thp teacher leaders were certified teachers who performed 
resource functions in the elementary schools (directing remedial 
instruction by the pull-out teachers' and classroom teachers, testing 
students, programming instruction, artd providing inservice training to 
district teachers). 

during a group meeting, the study was explained to the teacher , 
leaders who all agreed to participate. Teacher leaders wetfe randomly 
assigned during the administration of the test for the study to one of 
the 12 regular second grade classrooms which had been selected earlier to 
v have the test \^uiistered by a trained* teach^. 

Regular te^ers were introduced to the study via ajetter from the 
distritf Title I director and subsequently sent instructions regarding 
their particular duties during the testing. Untrained teachers were not' 
told about the research variaf)les< but were informed that they would have^ ; 
observers in their classrooms from time to time and that their students 
would be reinforced or trained. At the conclusion of the testing, the 
details of the study were provided to each of the teachers. 

j'est monitors . Whe Title I director selected 24 Title I pull-out 
teachers to serve a^ test monitors in all classrooms used in the study. 
The pull-out teachers were fully certified elementary teachers who worked 
a tidlf-day under teacher leaders as instructors 'to remediate Title I 
Students in reading and math. For the study, each monitor was randomly 
assigned to one classroom. to assist the teacher (trailed or untrairred) * 
, responsible for the. testing. ^ 



er|c .133 



119 



Treatment * 

The effect on student test scores and student and teacher behavior 
during testing of Three separate factors was examined during the study: 
(a) reinforcTig^jidents for scoring higher on the spring test than would 
have been predicted from their fall test, (b) training students in 
techniques to increase their test taking skills, and (c) training 
teachers to administer the test us4ng good testing'practices and 
standardized testing procedures. 

The design of the study was completely crossed; each treatment and 
control interfaced with other treatments and controls creating a 
2X2X2 block, or eight cells. Each of 24 classes of students were 
randomly assigned to one of the eight cells of the experimental design 
shown in Table 26. Random assignment was made by drawing slips of paper 
containing* concealed classroom identification numbers. The mean number 
of students per cell was 74.62 with a range from 67 to 85. 

• Reinforcement for students .* Past procedures for motivating 
students to do well have included the use of many types of verbal and 
tangible rewards. Nickels were chosen as the reinforcement iri this study 
for several reasofis: # 

1. * Students in Title I schools came frorfi low income families in the 
community creating the strong possibility that money was highly valued. 

2. Using money avoided the dietary probletos of food (i.e., candy), 
special preferences of prizes, or th'e confounding personality variables 
of praise. / 



134 



Table 26 

Assignment of Classrooms to 
Treatment firoups 



Trained Teacher 



Untrained Teacher 



Trained 
Students 
n u = 291 



Untrained 
Students 



n 



ts 



306 



Reinfo reed 


l a Bennion (25) b 
12 Washington (23) 
21 Backman (27) 
n - 75 


7 Lowell 

15 Whittier (1% 
24 Edison 

n = 70^ 


Un reinforced 


5 lincoln (29) 
9 Pardrtew (18) 
11 Parkview (24) 

. n = 71 


4 Jackson (27) 
20 Hawthorne (22) 
23 Edison (26) 

n - 7 C 

n - /d 


Reinforced 




8 Lowell (27) 
19 Hawthorne (31) 
22 Backman (27) 
n = 85 


2. Bennion (23) 
3 Franklin (21) 
13 Washington (23) 
n '= 67 


Unreinforced 
• 


14 Whittier (23) 

17 Emerson • (26) 

18 Emerson (27) 

n =,76 

• 


6 Lincoln (33) 
10 Parkview (22) 
16 Whittier (23) 
n = 78., 



re = 297 



-re 



300 



n tt = 307 



-tt 



290 



S The number preceding each classroom is a unique Identification code." 

v 

The number following each classroom indicates the number of students in 
each classroom. * * 



9 

ERIC 



13; 



1 



3. Previous research has demonstrated that low H) students perfjorm 
better on individualized aptitude tests with monetary rewards than 
without" (Rasmussen, 1973). 

4. Money is easy to dispense and control. ' ° ^ ^ 

5. Increments of nickels are equal and noticeable (i.e., the better 
the student does -on the test, the bigger the pile of money received). 

During Spring, 1980 district achievement testing, students in ^those 

classes which had been randomly assigned to the reinforcement condition 

were reinforced for doing better on group tests than was expected based 

on their Fall, 1§79 test score. The amount paid to each student was 

i based on the number of raw score points above an individually established 
i 

base (expected) score. The procedures for. determining the base score for 
each child are illustrated in Table 27 using hypothetical scores for 
three students. 



Table 27 



Examples of Procedyres Used to Determine the 
Amount of Reinforcement Given to Students 
Based on Fall Test Scores 





'Fall Testing Norms 
Raw Score Percentile 


Spring Testing Norms 
Percentile Raw Score 


Adjustment 


Payment r 
Criteria 


Actual Spring 
Score 


0 Reinforcement 


Johnny 


2i 


30 


30 


20 


<> -5 


15 


25 


$.50 


Mabel 


33 


50 


' so r 


33 


-5 


28 


27 




Helen 


' 40 


70 


70 


40 


-5 


35 


41 


.30 



ERIC 



r * 



For example, if d student had d raw score on the. Fall test of 23 (or 
33 or40)'^|his would be at the 30th (or 50th or 70th) percentile using 
Fall norms for* thuit level of. the test. l"f "the student learned at the 
r>6rmal r*ate of students in the nonoing sample, the same percentile -rank 
using Spring norms would be' maintained on the Spring test. The 30th 
percentile Vor 50th or 70th) using Spring norms for the level of the test 
used in the 5?>ring is associated with a. raw score of 20 (or 33«*or 40). 
Each equtpercentile score thus derived was .adjusted by subtracting 5 
points so that more stydfffts would be "able to ^earn money., 

I%the example, in Table 26-, the payment , criteria for Johnny 
(20 - 5 = 15) becomes the base-score above which Johnny must achieve on 
the Spring test to be. rewarded. If Jphnny's actual Spring Score was 25, 
then he would earn 10 nickels (25 ^ 15 =' 10) or $.50. Using thVs« 
procedures, payment crite^A were individigjV e'statfl ished 'for each 
For studentsj^o did\'t iake the |BEsV'{-absent during Fall 
testing or . transferred after Fall testing), teachers estimated a 
percentile rank fir the pretest fased -on classroom performance. 
- Four subtests were selected^as the units for reinforcement: 
Heading A, Reading B, Word Study, artd^tathemat ics (Math Concepts plus^ 
Math Xomput at i ons ) .^J though stude/its intheT reinforced group. were paid 
for performance on tht* ReadincrA, Reading B, and Word Study subtests, 
only data prk Reading "B and Word Study scores were used in the analyses. 
To convince students .that they would really be given money if they .did • 
tetter than^ the* payment critec4^ on the test, Rliding A was administered 
•to the students and reinforced i>n the day immediately preceding Reacting B 
and Worct-Study- After observing tfie actual payment of money for their 
pefformance on Reading A, the students were more likety to believe that 



they would get paid for ,doing better than the payment criterion on 
-Reading B and Word Study. 4 

Students in the unreinforced classrooms were paid for performance on 
. Math subtests to prevent a negative reaction from the nonpayment group 
for not having a chanc* to earn nickel s . All unreinforced classrooms^ 
received the reading tests- before the ^ath *ests so that data used in 
making- the reinforced vs. unreinforced comparison could be collected^ 

if* S 

prior to delivering- rewards for m?th performance. % 

Tfie procedures for reinforcing students are out "fined b^low. 

1. Notification . Just before givi/ig the directions to the subtest, 
thre t^st administrator r^ad th ii* statement : "Today you will earn' nickels 
"for* doing welj on the next test. The higher you score, the more nickels 
you will get. Try very hard to do your best, and I (your teacher) wlljl 
give you your money to take home this afternoon." - 

2. Scoring . On the day of reinforcement (same day as subtest), 
trained scorers traveled to each school with a test key, scored the 
surest, computed the amount earned, and prepared ah envelope of nickels 
for each rewarded student. * 

3. Payment . The classroom r teacher presented the rewarded students 
with envelopes containing their earnings just before school ended for the* 
day. . ' 



r^nit( 



Training for test adm inistrators and monitors. Traininq in 

. : . ■■ ~~ s — ; — ^ — 

appropriate' test administration was provided by the investigator to the 
12 Title I teacher leaders (test administrators) and to 12 randomly 

4 

selected Title I pull-out teachers (test monitors) prior to the ^oup 
achievement testing in the Spring. /The training program for each g^oiip 
of teachers, i* outlined below. . 



Teacher leaders were selected to receive training instead of regular 
classroom teachers for the following reasons: 

a) they fully supported the study ancJ had expressed a desire to 
bring a test 'preparation program into t?S^ d ist r ict ; 

b) because the> had fewer habitual reactions to particular students 
^ than would the regular classroom teacher, it was thought they 

would be more consistent and dependable in carrying through with-> 
appropriate behaviors learned during the training; and 

c) teacher leaders were chosen over nonteaching personnel because 
they had some familiarity With the students and with proper group 
testing procedures. 

To prepare for the test administration training, a videotape was 
constructed that depicted actual classroom scenarios of correct ancl 
incorrect methods of giving tests. Second grade students from 
Wellsville, Utah, and their teacher agreed to act in contrived situations 
to portray the .results of appropriate and inappropriate test 
administration. Permission for filming was obtained from the teacher, 
principal, and parents, and air\ad an opportunity to view the final 
product (see Appendix 10 for approval 'forms)*. 

Displayed'on the film are both the right* and wrong methods for 
^giving t&sts. Previous observations and research were used to compile 
the script and included: - 

1. Preparing students for the test. 
. 2. Arranging the testing room. 

3. Enterihg the testing room. , 

4. Distributing test materials. \ > 

. • • v 



13d 



* 



425 



5. Giving directions. 

6'. Monitoring the students. 

7. Using an aide. 

8. Finishing the test/ * * • 

9. Providing assistance to students. 

# / * * 

10. Dealing with unexpected events. * ' 

j 

11. Pacing. t 

. ' 12. Obtaining group response. 

, Training t^j^ place in the Salt Lake City District Offices two weeks ^CZ^^ 
before the district .testing. Two sessions, on the afternoons of April 15 ~* 
and 17, were held from 1:30 to 3:3(T. Training content covfered general 
test Administration^ oR-task teacher 'behavior, the Stanford Achievement * 
^"est (6AT), and the use of test monitors. Procedures for obtaining group 
respc^se and -for*explaining test format were included in the training. 

i • I 

. * » 

Special ^at tent iioo v^as gi vejfi to unexpected testing problems ^and teachers 
weype instructed oil appropriate responses. For example., if a child needs 
tb t> t^e tlw-WMir(K)fT\ during a timed test,* the administrator should* record 
tr\e^ine,absfrn|; and extend the test by that amount for the individual . 
sttfcte^t. -\ • • % ' 

*The° (Jual ity of Test Administration Checklist (described in detail in 
ChapJ|r IV) w^j^xplained jjjpor to seeing the videotape and was used -as 
*an fi obWryabion guide-jduring the film. Difference^ in student behavior i 



urjder' incorrect and correct test administration were then discussed. 
^Direction* for the SAT Practice Test,. Level II, were discussed and each 



* trained teacher was asked to administer the practice test to their m 

v » assigned classrtfom. Copies of the practice test were provided £o*the 

' * teachers, and they were instructed to use the practice teSt as a teaching 

\ device and not as a test. «. . f • 4 

ERLC * - < ' -HO , 



The Title I director selected 24 Title I pulj-out teachers for use 
>^as test rflohitors." They were randomly assigned to~ classrooms and alT 
those who were assigned to classes which had trained test administrators 
were given training. On April 7, 1980, letters were 'sent to the selected 
pull-put teachers informing them of their participatory in tVie upcoming 
Spring (est^a ^ indicating the training schedule (for the 12 who were 
to be trained). 

-Training provided to pull-out teachers in test monitoring 'resembled , 
that given .to the teacher leaders but was condensed to 2 1/2 hours on 

Wednesday, April 16, • 1980. Specific classroom testing problems and 

' ' ' 1 

methods tor alleviating them were discussed. For example,' a student who 

' 9 

/is constantly off task could be givfen physical prompts (i.'e., a hand' on 
the shoulder, assistance in moving finger from item to item during timed 
tests). L ike - the test administrators, monitors were encouraged to 
actually take the test as if they were students prior to the time the 
test was administered to the students. 

TrainifW) for students . Twelve cl ass rooms- were randomly selected 
to participate in student training in test-taking skills consisting' of 
coaching in answering specific types of items, training in testwiseness 
(i.e., how to guess or deduce answers), and answering sample items on the 
SAT Practice Test, Level I. Students were trained 'by the investigator in 
their own classrooms for one hour during the morning of a regular school 
day 1 to 2- weeks before the. actual testing. The classroom teacher was 
not present during the training and each session followed the same' 
format. The training "schedule, a complete co^y^f training procedures, 
and the Practice Test are available on request. - 

4 . 

141 



127 / 



The activities and topics covered during the student training 
included: s 

% 1. Purpose of the test. * 

2. ' Group response. 

f l *• * 

3. Posit ive 1 atmosphere. # 

4. Rules for correct testing behavior. 

5. ' Machine scorable answer forms. r x 

6. Unusual directions. 

7. Practice .test . . s < 

8. Multiple choice^rtems. 
r 9. Rest period. 

At the end of 50 minutes, the lowest five performers identified 
during the training were given 10 'additional minutes of Individual help, 
while the others completed some puzzles. 

Dependent Measures 

Four dependent variables were^measured to assess the effectiveness 
of the various intervention conditions: reading test scores, student 
on-task behavior, teacher pn-ta^behavior, and test booklet marks. 
Instruments and procedures- for collecting data.and programs for training 
observers to record the data were developed during the study. The 
following sections describe the dependent variables and the instruments 
used to collect data. . 

Test scores . The Stanford Achievement Test (SAT), Primary Level 
II, Form A (Madden, et al., 1972), used .for the Spring test in grade two 
to measure the impact of the interv&ntipns on students 1 reading scores. 



ERIC , 142 



Levfel II is both a norm-referenced and objective-referenced test designed 
for *group administration to assess skill development in vocabulary, 
reading, mathemat ics* spelling, word study, and listening comprehension. 
The manual for the SAT contains the geheral directions for administering 
the test as well as the specified verbal instruction to be read to the 4 
students for each subtest. Reliability data for the SAT consists-of 
split-half estimates and KR-20 coeff icietits. Reliabilities for all 
subtests at all levels on all forms range from .65 to .97 with the 
majority between .85 and .95. Evidence for the. validity of the tests 
consists of two types of information: (a) *n increasing difficulty of 
items with higher grade levels, and (b) a moderate to high relationship 
with previous SATs and with the current and previous Metropolitan 
AchievementTests. 

Based on test scores from September, 19Z9, Level II was selected to 
use in % testing second grade 4 during Spring, 1980, to avoid floor and/or 
ceiling effects. The entire test (excluding optional 'subtests) was 
administered in four days to students in each treatment group according 
to directions specified in the manual. Make-up tests were given to* 
absent students from May 2 through May 5." During the scheduled testing, 
subtests were given each morning, the first section administered after 
morning exercise followed by a recess and then the second session. 

♦ 

Only one day of the four-day test was utilized for data collection. 
A combined score of the Reading B (RB) and Word Study (WS) was selected 
as the most suitable depertddnt measure because of the variety in item 
format, item content (comprehension and phonics), and subtest sequence in 
the testing schedule. Both timed tests, where students move at their own 
pace, and teacher-directed tests, where the class moves as a unit, were 

t 

* US 



part of RB dnd WS. Reliability coefficients established during the 
norming were .96 for RB and .94 for WS. 

The order of administering RB and WS was varied among the classes, 
but both subtests were always given on' the same day. Since Vocabulary 
and Readjng A subtests precede RB and WS, rro data were collected on the 
first day of" testing, and all students. had some testing experience before 
taking RB and WS. 

Test booklets were prepared for machine scoring by erasing all 
irrelevant marks and darkening light circles. Tests were scored by the 
Utah State Office of Education. A combined raw score, RBWS (113 total) 
raw score points was used as the test score dependent measure. 



Student on-task behavios . A review of the literature and pr^aous^ 



observations contributed to a list of appropriate behaviors most 

conducive to producing high le^ls of attention to academid tasks. The 
A 

definition of student on-task behavior used in this study was described 
earlier in Ch'apter IV and definitions, ex-amples, and nonexamples of 
acceptable activity urtder both teacher-directed and student-directed 
(timed tests) test taking are provided in Appendix 5.,. Jo illustrate the 
application of the definition, suppose a girl were' twisting a shirt 
button with her fingers. This behavior (see "body movement, playing with 
clothes") is on-task if she is not looking at the button {see "looking, 
at test paper") but off-task if she.Js^ looking (see "looking, at 
clothes").' . 

^tudent on-task behavior and teacher on-task .behavior (see below) 
were both recorded on an interval form developed, field tested, and 
revised for the study as explained in Chapter IV (see Appendix 5). 



1U 



130 



Prior to actual data collection, an interrater reliability 
coefficient was obtained to establish instrument reliability pver six ' 
trials completed after the last revision. Reliability was computed for 
each subject separately by the equation: 

Interrater _ t Number of Agreements /,x 

Reliability ~ Nurpber of agreements + Number of disagreements 

where: Number of agreements 2 intervals that two observers record the 
same mark, and Number of disagreements = intervals that two observers 

record different marks. 

For the field test, three graduate student* were paired for six 
different trial observations and collected data on five students and one 
teacher during earth observation Tahle ?ft r.nnT.airK the reliability-—*—- 
coefficients for students and teachers for each field test trial. -Mean 
coefficients of ,.878 for student on-task behavior and .854 for teachers 
were obtained by averaging coefficients across trials. 



TABLE 28 

Interrater Reliability Coefficients for 
m m Trial Observations of On-task Behavior 

~> 

Subjects Trials 
• Observed 

1 2 3 4 , 5 6 , 

Students (mean) .865 .875 .712 .846 .985 .984 
Teacher • , .625. .844 .911 ^.867 .941 ■ .933 



f * ' ' f 

. . U5 



* During the dctudl research study recordings of student oq-tasl^ 
behavior were made by trained observers .during Reading A (RA), the RB\ 
and WS subtests. Teachers were asked to identify the five Title I or 
lowest achieving students in their classroom. Observation^ were made on 
the same students for both subtests. On the observation form, observers 
identified the students by a physical characteristic and used a tape 
Recorded beeper to move 'from block to block to record M on task" or "off 
task". Recording began t when the test administrator started reading the 
directions^and ended whejn the subtest was completed. Five second 
intervals consisted of 3 seconds to observe and 2 seconds to record. 
( Observers watched each child for 4 consecutive intervals or 20 seconds (4 
intervals x 5 seconds = s 20 seconds) before moving to the rrext student. 



Data was recorded on five students and one teacher (six subjects) • 
for 2 minutes (6 X 2Q Seconds) before repeating the cycla. The average 
observation time for the RA, RB, and WS subtests was 29.7 minutes. Since 
testing time varied across classrooms, the numerical unit chosen for 
analysis was the mean percent of on-task behavior per RB and WS 
observation. Percentages for each student were computed by the 
equation 



Percent of On-task Be^vior = Intervals On-task 

w Total Intervals Recorded 



The mean percent was the average percent across^the five students 
observed and the two observers if the observation was paired. 



4 

132 



Teacher on-task behavior . Standardized testing procedures listed 
in the SAT teacher's manual and preliminary observations formed the basis 
for defining teacher on-task behavior (explained in Appendix 5). Actions 

i 

consistent with amending to student behavior at all times (while 
directing the test administration under standardized conditions) are 
'included as examples of on-task 1 behaviors. For example, a teacher is on 
task when reading directions but off task when talking to the entire 
class during a timed test (TT), 

The interval recording form described above for recording student 
on-task behavior was also used by trained obsei^ers to collect data on 
the teacher. The last set of four 5-secomTTntervals on the recording 
form was employe4 to watch the teacher (20 sec'onds every 2 minutes). Tbe^ 
numerical unit us>ed irTthe analysis was the percent of on-task behavior 
(see Equation 2). 



Test booklets . Marks that would invalidate answers whM scored by 
machine were identified as an indication of inappropriate wtoent testing 



behavior. Invalid marks occurred when students drew pictures, skipped 
items, fitted if more than one answer, tore booklets,' erased too harc^ 
used crayons of ink, or wrote too lightly. Examples "and procedures for 
data collection are described more fully below. The numerical unit used 
•>n the analyses was the percent of items or test^jiooklets with 



errors. 



Observing and Reinforcing 

Personnel hired to observe and 'deliver reinforcement were mothers of 
Title I children contacted through the Title I Parent Advisory Council. 
During an orientation meeting held ort April 21, 1980, 



T47 



\ 



133 



observation procedures were explained and 17 people contracted to work. 
The pay rate of $4.00 per ^hour included travel from "home and data collec- 
tion. Three types of work were offered to the observers: collecting 



data obtesting behavior, scoring and reinforcing reading and math 
subtests, dfnd collecting data on test booklets. The procedures for 
training observers ^nd implementing each activity are described 
below. / 



Collecting data on test behavior . Seventeen observers attended 
all the training 'and collected data on student and teacher testing 
behaviors. Two major training strategies were involved: (a) training 
prior to going into the classroom during which observers familiarized 
„tbsms*J#es with the system and pr^ctice4 using the s yst e m -by-^e orin g a 



videotape of unrehearsed testing scenes, and (b) practice in the " 
classrooms. Two training sessions were held on April 21 and 25 to - 

y 

explain the subtests^Lo^Ue given during the observation time, the student 
and teacher on-ta£k behavior definitions, the recording form, the , t 
observation procedures, and the observation schedule. 

Observers were trained to collect data by intervals moving f/om cell 
to cell on the form at a signal from a tape recording that indicated when 
to observe (3 seconds) and when to record (2 seconds). Portable tape 
players were equipped with earphones for two people to use simultaneous- 
ly, facilitating interrater reliability calculation. 

Recording started when the teacher began the directions and observ- 
ers marked each cell for on-task (1) or off-task (-) behavior. Each of 
the five students ahd the one teacher was observed for 20 consecutive 



9 

ERLC 



• US 



134 



seconds, or four cells, during each 2-minute block of time. Observers 
computed percent on-t^sk by dividing the ntgnber of "1" marks by the totdl 
number of intervals. All computations *ere checked by a second person 
and errors 'adjusted. <f 

Following each videotape practice, -data were checked against the 
standard (correct recording), discrepancies discussed, and forms 
collected for computing reliability. Data collection from videotaped 
s#nes was practiced on both training days and during a Short meeting 
after the classroom practice. Reliability was computed from data 
comparing each observer with the standard using Equation 1. The obtained 
coefficients for each observer ranged from .86 ; to .93 with an average of 
.90. 



Observers were assigned in pairs tcr classrooms to practice data 
collection on April 28. .During the first subtest (vocabulary), observers 
watched the testing to get a "feel" for the classroom situation and 
recorded no data. Observers did record behavior on the interval form 
during the second subtest, Reading A, as practice and to compute 1 
reliability. Data collected during the classroom practice obtained an 
interrater rel iabi 1 ity. of .88 using, paired observations with Equation 1. 
In addition to using Equation 1, pairwise correlations of the paired 
observations for the five student on-task percentages were computed 
across observers. Mean corVelations^were RA 28 .84 (SD = .20), RB = '.89 
(SO = .12), and WS = .81 (SD =>^f5). 

Actual observations begafh on Tuesday, April 29, the second test day 
and continued through Thursday, May 1. Observers were randomly assigned 
to classrooms administering RA, RB, or WS and collected data alone (31 
observations) or in pairs (38 observations ) y depending on the time and 



ERLC 



149 



~ ' 13S 

ddy. One observer functioned as d substitute to replace absentees. Data 
from dll the pdired observations were used to' compute interrater 
reliability using Equation 1; Reliability ranged from .74 to. .97, with a 
mean of .86 for WS, .91 for RB, and .88«for RB and US combined. 



Test scoring and reinforcing . Twelve" observers were trained for 
scoring subtests and computing money earned as reinforcement for 
performance above payment criterion.- Students in the reinforced group 

• were paid money for scores on Reading A, Reading B, and Word Study, and 
-^K§tudents in the unrernforced group (control) were^paid for'Math scores. 

Scorers were trained at the Salt Lak^City District-Office from 2:00 to 
4:00 p;rruon Monday, April 28, 1980. Training consisted of practice in 
— - scoring samples of^arttTUbtests and ctimpuTTng money 'earnecT Based on the 
results. Scores for correct and incorrect answers were recorded on a " 
sheet separate from the booklet and totaled. Tin's total (the actual 
score) was placed on a Reinforcement Record that contained precomputed 
individual cutoff-scores above which the students had to score to earn 
•money. For students whose names were not on the Reinforcement Record, 
the raw store corresponding to the 50th percentile was inserted as the 

• cutoff score. The difference between the actual score and the cutoff 
score for each child was the basis for payment. All scorers obtained 

^100% correct on each.sample test score and computation. ^ 

fj Scorers were randomly assigned to classrooms scheduled for rein- 
forcement and reported to the school during lunch the same day that the 
i 

appropriate subtests were administered with rolls of nickels, envelopes, 
an answer key, and a Reinforcement Record. Test booklets were scored at 
the school, reinforcement computed, and the appropriate number of nickels 



ERLC 



150 



136 

A 't \ 

r 




W4* 5^4ced jn ai\enve lope- for each rewarded student and left with the 
te^cherT Tests taker? from April 29 through May 1 were scored by the 
srcorers. Make-ups were handled by the teach'ers, and money was delivered 
to the schools based on their reported needs." 

The mean amount of money earned per student ,undeV reinforcement 
conditions (RBWS) was 91$ for target students and 45f for npn-targef * 
students. Uader. unreinforced conditions-, target students were jgaid a 0 s 
jnean of ^£l,13 per targetlstudent and 80^ per non-target student- seores . 
' Of the 597 students, 519 (87%) received mon#y during the testing 

•eek. S* 

o • 

Test booklet data collection . Se^n observers collected d^#pfr 
'test booklet marks and-prepared them for machine scoring. Interrater 
reliability data Was collected at random intervals When paifwof 
observers were periodically assigned to' do the stfne' booklei^pd d.ata wer 
-compared. Correlations were computed by observer pairs on othe numbers' of 
errors recorded per student and were used as reliability coefficients.* 
An Average interrater reliability coefficient of .99 was obtained from a 
range of coefficients of .93 to 1.00.' * t 

During the training, observers practiced scoring sample tests 



0 



oring) by the teacher, number of items that required erasing &c 



containing all types of violations. A list of errors, training sample!?, ---^ 
and tf\e data collection form are "provided*^ in Appendix M. One recording 
form was used for each student, aj\d data frorrTfcach page were recorded on 
• separate lines. Informatio^ col lected on booklet covers was the*numt^r \ 
;m of covjBfS that had. blank circles, or ^ad wrong circles darkened. Other 
xJata werej number of booklets that had been erased (prepared for machine 



4 



darkening, and number of i1;ems With no. .answer, more fyan one circle, or 
the wrong answer format. As part of their duties, observers were 
required to prepare the booklets for machine scoring after recording 
data. To prepare the pages, extra'neous marks were fe erased, light circles ' 
^^arkened, cover informdtion corrected, and answers recorded properly. 

Results and Bvscussion 




o^ assess differences attributable to the various factorVjin eluded. 
in'tJje experimental design, a Univariate Analysis of Variance (ANOVA) was- 
performed for each of the dependent- variables . As described in *he 
Procedure section, the independent variables consisted of training 
students (TS), training^teachers' (TT) and reinforcing students (RE). * 
Dependent variables were reading test scores, student on-task behavior 
during testing, teacher on-task behavior during testing, and students' * 
fnvalid marks on test booklets. ^ 

^ In the analyses, (the mean classroom scope on each 4ari able was used 
as the urtit of analysis because entire classes were randomly assigned to 
the^experimental conditions and training was applied to the class as a 
whole., The' following sections d^cribe the results from the *oalysis for, 
each dependent variable. * v " • ' * ' * 



•To protect against .inflating tfte'Type I "error rate, 
•Multlyarjjtfl Analysis of Variance (MANOVA) is sometimes suggested when 
tnultiple dependent measures are being examined. However, since only four 
dependent variables were examined in the study and sincc univariate 
analyses are typically the second Step in a MANOVA, it was conceded that 
a MANOVA would make the analysis ^necessarily complex wffhdut 
contributing any significant advantages. 



4 

\ 



' • ■ / i: 

r 

Test Scores 

A three-wdy^ANOVA was used to determine tf students 1 composite 
redding subtest scores (RBWS) were statistically significantly different 
under various treatment conditions* Additionally, the standardized mean 
differences between treatment and rtontreatment were examined to determine 
the educational significance of the findings. 

All students. Results for the combined RBWS score for all students 

. • : — • — • ' 

are presented in Table 29* Statistically significant main effects were 
found on TT, £ (1,16) = 3.48, £ < .10 and RE.'F (1, 16) =.16.77, £ < 
.001. Means for RBWS scores (presented in Table 30) indicate that 
students receiving r^hforcement (Y =87.53) obtained higher RBWS scores 
than students who were not j^inforced (X = 77.54) and students with 
trained test administrators had higher RBWS scores (T = 84.82)' than 
students with untrained administrators (7= 80.26). 

Even more important than the statistical significance of these dif- 
ferences, however, is the educational significance. A number of 
approaches have been suggested for estimating the educational or 
practical ^ rg nif icance* of ^^)bserved difference. The Joint Dissemina- 
tion Review Panel (JDRP) suggests an approximate rule of thumb for most ✓ 



education^asures. If thV difference .between two groups is larger than 

*^ 

1/3 of a standard deviation, the difference can- be considered to be 

* 4 

educationally significant (JDRP, 1977). Others have suggested 



This way of depicting differences between groups tias been <\ 
referred to as ah "effect si^e" (ES) by Glass (1977). Computationally, 
it is derived by ,the following equation: Xj - Yq . 

SU C ■ • 

In, the remainder of this sectfon, effect sizes ViU- always be computed 
with the standard deviation of scores using individuals "as the unit of 
analysts. \ ,v 



153 




Table -29- 



Summary of Three-Way ANOVAof Treatment Conditi 
on RBWS-Scores f<y All Students 



•139 



ons 



Source 





df 



SM 



Trained Students 


62.05 




62.05 


1.74 




Trained Teachers 


124.17 




124.17 


3.48* 




Reinforced Students 


598.90 




598.90 


^ 16.77** 




TS X TT, 


*! 154.79 


Vi 


.154.79 


\ 4.34* 




ts r$T 


28.76- 




28.76 


.81* 




TT X RE 


- 37.93 




37.93 


1.06 


! 


TS X TT X RE 


• 10.92 




10.92 


.31 




Error . . 


591 .^29 
1588.80 


16 " 


35.71. ■ 






Total 


23 


* 







*p < 

**p < 



.10, . 
.,001, 



A 



9 

ERIC 



154 



measures such as. <i>2 (omega squared) ^"an indication of how much 
variance in the dependent measure is accounted for by a particular 
independent variable (Hayes, 1973). The computation of ^u2 



9 SS hofwoon - (J-l)MS 
d _ between ' error 



w ms + ss* 7 ; 

+ error total ^ 

is an effort to further quantify the strength of the relationship between 

an independent and a dependent variable. Although Glass and Hakstain 

(1969) showed how the interpretation of a)? can be problematic in 

» » 

certain situations because it may underestimate the importance of the 
relationship, a) 2 does provide additional information which carr be used 
in determining the importance of differences detected with Analysis 'of 
Variance techniques. 

/ As noted inHTaWe 30, the standard deviation for RBWS when students 
are the unit of analysis is 21.49. Thus, the effect size of RE is .46 
([87.53 - 77.-54] / 21.49 = .46) or almost one half a standard deviation 
unit. Estimating .the strength of the relationship using u) 2 indicates 
that whether or not students were reinforced, accounts for 34.7% of the 
variance in their test scores. The effect sye for the trained teacher 
(TJ) condition is .21 or approximately 1/4 of a standard deviation, and 
computation of uu 2 indicates that- this factor is accounting for 

j approximately 5.4JTof the variance. The TS,condition has an effect size 
„of -.15 and accounts for only 1.6% ([62.05 - 35 ! 71] / [$5.71 + 1588.8]) 
of the variance' in students 1 test scores. 

Further interpretation of the magnitude of *the .46 (RE) and .21 (TT) 
effect sizes results from examining Othe range and skewness (-.62) of the 

/ frequency distribution of the test scoreT\(RBWS). An 85 point range of 

• / 



Table 30 

Mean RBWS Scores and Standard^ Deviation for All Students 
in Each Treatment Condition 



X TS = 8(J.93 



S0 TS = 9.10 



/.TSJ84.14 
^O'.jS - 7*48 



TS 



rTS 





RE 


-RE 


TT 


(01) 97.87 (10.78) 
1"J 83.90 (25.75) 
(21) 84.29 (18.08) 


(05) 70.41 (20.12) 
(09) 76.39 (22.34) 
(11) 71.13 (22.21) 




X =88.69- SD =7^96 


X =72.64 SD = 3.26 


■ 

-TT 


(07) 96.56 (13.29) 
(15) 76?94 (14.82) 
(24) 82. 5fr (25.08) 


(04) 70.26 (26.16) 
(20) 80.31 (22.88) . 
(23) 80.54 ^9.01) 




X = 86.35 =10.10, 


X = 74.04 SD = 5 -8. 7 


TT 


(08) 96.27 (10.92). 
(19) 93.79 (15.35) 
(22.) .?0.27 .(17.-4SX 


(14) 80. 52^(20. 78) 

(17) . 88. 58 (1 7.97) 

(18) 84.33 (22.70) 




X = 93.44" StV=3.02 


7 =84.48 SD = 4.A8 


*-TT* 


(.02)- '83.*71 (21.27) 
(03) 76.47(22.87) 
(13) 87.77 (-23.31 ) 

X = 82.65 SD =5.72 


73,36 (22.11-) 
(10) 73.91 (2*2.27) ' 
(16) 80.'X7 (16.90) 

X = 76.01 SD =.*.13 



Unit of Analysis: 
SD 



82.54 
8.31 



*RE = 87 - 53 

■sd ; '— 7 - 44 

RE ■ 



V -RE 



SD 



77. W 
5.88 



-RE 



TT = 84.82 ' 



SD 



TT 



9.10 



X_ TT = 80.26 



SD 



TT 



7.09 



Using students instead of -classes 
as the Untt of Analysis: 

* = 82". 59 
SD » 21 .49 



MbTE: plumbers before each cell entry identify the classroom. Numbers in parentheses after each entry 
are the standard deviations for £ach classroom completed by using the individual student as the 
Unit of Analysis. , * • * 



142 



scores from a 113 maximum indicates a large variance and a high 

probability of a Small effect size when comparing classroom means with 

the individual student standard deviation. The negative skewness occurs 

because the median test score for all students was 88.15 with the bottom 

50% of the students scoring from 28 to 88 (61) points and the top 50% 

scoring from 89 to 112 (24 points). The fact that half of^the students 

received scores in only 28% of the range and 20% of the students scored 

withinflO points (from 103 to 1 1 3 J of the maximum possible (113), 

suggests a ceiling effect on the test results. Consequently, although 

substantial increases are attributable for these two treatment 

conditions, the full impact of the reinforcement and trained 'teacher 

conditions may not have been demonstrated because the top students were . 

obtaining scores near the ^maximum. - - ^ 

• The mean raw score for RB and WS, reported in Tattle 31, show a 

difference between RE treatment and nontreatment groups of 4.5 for RB and 

5.5 for WS. Translating the raw score to percentiles (Table 31) a 

difference under & in percentile rank is 8 points (54 - 46) for RB and 

14 points (62 - 48) for WS.. The difference between RE treatment and 

nontreatment groups, in grade equivalence is over a half a year for WS 

(3.6 - 2.8 = .§). . , t 

Taken together, the information yielded by calculating the effect 

size and u)2 for each* of the' main effects indicates that whether or not 

students are reinforced is educational ly as well as statistically 

significant in accounting for the results of group administered 

v standardized achievement tests. Motivating students to try hard on the 

i 

test seems to be particularly important. Training test administators 
appears to be' a moderately important variable, while training students 
'(at least as it was done in this: study) does not appear to be important. 

ERIC ' • ' " 15 J 



143, 



Table 31 



/ Raw,Score A J>ercentile, and Grade Equivalent for Mean RB 

and .WS Scores for Treatment and Non-Treatment Groups 
, Under RE and TT Conditions 



RB WS 



RE . . Treatment Non-treatment Treatmertt Non-Treatment 

Raw Score 35.49 30.99 52.04 46.56 

Percentile 54 -46 62- 48 

Grade Equivalent 2.9 2.7 3.6 2.8 

TT ' 

Raw Score 34.59 - 31.88 50.21 48*.28 

Percentile 54 48 56 52 

Grade Equivalent " 2.8 2.7 3.3 3.7 



-/ 



15!) 



s > 



< \ 



144 



Given the way that students were randomly assigned to treatment 
conditions, these results suggest that the motivational level of the 
student and^the conditions of test administration are causally related to 
the scor£)Obtained by the student. 

Although not statistically significant, untrained students did 
receive higher RBWS scores (X = 84:14) than trained students (X = 80.93) 
(Table 30). This can be interpreted tomean-that differences this large' 
or larger would obtained mdre than 10 times out of 100 if two samples 
of this size were randomly drawn from the same population. This L e\tidfence 
suggests that training students with the type of test-taking package 
>tescribed in this study will not influence test scores of second grade 
students from Title I classrooms in a metropolitan area. 

A statistically significant two-way interaction (TT by TS) was 
found, F (1,16) = 4.34, j> < .10 (Table 29), These results indicate that 
training students in test-wiseness is influenced by the status of the 
test administrator (whether trained or not). Similarly, the results of 
training test administrators is influenced by the degree of test taking 
training provided to the students. 

Graphing the interaction- ( see Figure 1), it can be seen that trained 
students had higher RBWS scores when the test was administered' by 
untrained teachers { X" =• 81 . 2) than when the test *as administered by 
.trained teachers (X = 80.7). Conversely, untrained students scored 

higher under trained teachers (X = 89.0) than under untrained teac+iers 

♦ * 

(X = 79.3). These results indicate that the training provided to 

« " — * 
students was not an effective agent, in increasing scores and may even 

have had a si iqht detrimental effect when coupled with the effect of 

having d trained teacher administer the test. That, is, trained teachers 



ERIC 



« 145 



■(I 




TRAINED 
TEACHER 



TRAINED 
STUDENT 



UNTRAINED 
TEACHER 



Figure 1 



Tne 



graph illustrates the interaction of TT X TS for 
all students. Numbers in the qr*aph are the mean 
RBWS scores obtained by classrooms under 'specif ied 
conditions. 



ERIC 



161 



r 



146 




were m<re influential in raising scores of the untrained students than of 
the trained students. 

One possible explanation for the interaction is that trained 
students were too confident about testing skills and did not attend well 
to the actual testing task. Another possible explanation for the lower 
scores 'of trained students is that training was not effective and even 
confused the students because the trainer and^the trained test 
administrator were -strangers, and the untrained test Administrator was 
their regular classroom teacher. These findings suggest that higher test 
scores may be obtained if th^Ptest administrator is familiar to the 
students, is trained in appropriate testing procedures, and provides 
students with reinforcement {motivation) for trying to do well. 

J retest/no pretest students . Analysis of covariance (covarying on a 
prfetest> takyi during Fall, 1979) was originally planned for this study to 
improve the statistical power of' the design. Scores on the protest were 
available for 428 of the 597*students in the study, consequently, the use' 
of ANCOVA would have resulted : in on 1 y_J2%^428/ 59 7 ) of the students Aeing 
included in the analysis. To support the use of ANCOVA, students wfth 
the pretest scores should have been representative of students without 
pretest scores. A breakdown of test scores (Table 32) was prepared to 
assess whether posttest scores for students with pretest scores were any 
different from the posttest scores of students for whom pretest scores 



i po^ttes 

- x 



were not available. 

The posttest means and standard deviations for the two groups are 

reported for RB, WS, and RBWS for each treatment condition. In every 

condition, the group for whom pretest scores were available had higher 



a * 162 
ERIC T 



Table 32 



-Mean Test Scores and Standard Deviations for 
Students With Pretest Scores and 
Students With No Pretest Scores 

T ' _ „ " Number RB WS , * RBWS 

Treatment Groups Pretest No Pretest Pretest No Pretest ■ Pretest No Pretest Pretest — No Pretest 

Reinforced Students W ~ 70 36.10 34.81 52.43 52 36 88 53 87 17 

■O0- 5 ) , (10.2) (10.35) (10.6) 09.6) 09.6) 

Unreihforced Students 221 71 -32.34 26.24 47.50 " 43 27 79 84 69-51 

_____ - ___ ' . f 10-7) (13.3) (10.8) (1 2.3T (20.*3) (24^5) 

Trained Teachers 223 . 72 35.32 32.79 50.70 49.03 86 03 81 82 

• -< 10 - 3 ) CH.'3) (11.0) (11.7) (20.2) , (^.*9)- 

Untrained Teachers 205 69 32.90 28.10 48.99 46 48 81 89-* 74 82 
(H-2) (13.5) (10.7) (12.9) » (20.5) (25.*3) 

Trained Student's 204 ' 68 33.06 29.51 49.55 45.99 ■ 82 61 75 50 

(10.8) 3) (11.0) ^ (13.0) (ZOJ) (25.'l) 

Untrained Students 224 73 35.17 ' 31.41, 50.19- ^ 45.45 85.35 80 86 

' . ' (10-7) (11.9) - (10.7) (11.5) (20.1) (ZZ'.SI 

' " """ — -~- f - — 

AH Groups 428 ' 141 34.16 - 30.50 49.89 47.78 / 84 04 78 28 
: (10-7) (12,6) (lp.g) " (12.3). J (20.4) (23.'s) 



163 



161 



148 

. • . ' ' • 

* mean posttest, scores than the group for whom pretest 'scores were hot ^ 
'availablT. The mean RBWS (posttest) sCore for, all treatment conditions 
wds 84.04 for the pretest group and 78.28 for the no pretest group. The 

% posttest score difference between the protest gnoyp and the no p^efrest 

I \ ■ 

group was statistical^ significant, t/(595)< = 2.77, £ <^01. 

Due to the posttest score, differential^between the pretest and the 
no pretest group, it was concluded that students for whom pretest scores 
wefe available were not rpprersCrvtati ve of the entire sample. Therefore, 
using ANC0VA ^puld have biased K he results *and* was deemed inappropriate 
for the study. * ^ * 

»* 

Teacher Behavior 
i 

% Observatiooal data of teacher#\on-Usir bel^vior during testing were 
collected during the Reading 8 and Word Study subtests. A three-way 
analysis of variance was used to analyze the data across treatment 
conditions. Results presented in Table 33 shbw statistically significant 
main effects fpr trained teachers, £_ (1,16) = 36.34, £ < .001. 

m * 

As indicated in Table 34, the mean percent of on-task behavior for 

r 

trained teachers (X = 72.9) was statistically significantly higher than 

for untrained teachers (X = 25.5).. Although' not statistically 

significant, means for the other treatment groups (X75 = 50.7; 

^*RE = 54.9) were higheothan the non-treatment group- ()Ly$ - 
_ * r 

47.7; 0(.r£ * 43.5).' Individual teacher percentages of on-task 

* behavior rang^d^from 0% to 100>£. . ' ** 




16;; 



'A. 



. ^ Table 33 

* 

Summary of*Three-Way ANOVA of Treatment Conditions, on 
Observations of Teacher Behavior During RBWS 



Source 


tk ss 


df-- 

u t 


no 






Trained Students * 


31.28 




31. 


28 


. "709" 


• Trained Teachers 


13357.60 


r i 


'13357. 


60. 


36.34* 


Reinforced Students V 


704.17 


i 


704. 


17 


1.9 


TS X TT- 


1.08 


• 

i 


1. 


08 


.00 












TS X RE 


13.95' 


i 


13. 


95 , 


.04 


TT X RE * 


85.50 


• i 


85. 


50 


.23 


TS X TT X RE 


2.4L 


i 


* 2. 


41 


.01 


Error 


5880.49 


16 


m 367. 


53 




Total 


.20076.49 


23 









*£<.001 



\ 



Table 34 



Ksan Percent of Teacher On- task Behavior. During RB and WS 

By freatryt Condition 



Trained 
Teachers 



Untrained 
Teachers 



Trained Students 



X TS = 50.7 
>SD RS - 29.4 



Rein^rced 
Students , 



98.8 
68.1 
70.1 



X = .79.0 
SD =^7-2 



Unreinfcr^ed 
Students 



82.9 
67.3., 
58. (J 



X = 69.4 
SD = 12.6 



23.7 
60.1 
24.2 



X 
SD 



36.0 
20.9 



35.4 
8.4 
11.0 * 

t 

X-= 15.3 
SD = 14.9 



Untrained Students^ 



X = 47.7 

- • TS 

. 'SDt C = 31.4 
IS • 



ReinfoHcid 
Students 



85.8 
52.1 
85.9 



X = 74.6 
SD = 19.5 



Unreinforc€d 
_j Students 



81 .3 
• 70.6 
53.5 

X = 58.2 
SD = 31.6 



12.4 
60.6 
17.4 



X = 30.1 
SD = 25.5 



2.0 
,'5.0 
46.0 

X f 17.7 
SD * 24.6 



Using individual 
observations as 
the Unit of Analysis - 



X 

SO 



51.7 
29.66 



X n - 72.9 
SD n = 14.4 



X JT .= 25.5 



S . D TT 



<a 20.6 



.151 

Using individual observations as the unit of analysis, the' overall * 
standard deviation of teacher on-task behavior is 29.66. The effett size 
(Equation 3) of TT is 1,6. The strength^ the relationship between 
trained and untrained teachers (estimated by computing - 
Equation 4) indicates that 63.5% of the variance in teacher on-task 
behavior is accounted for by Whether -or not teachers were trained. . 

Educational significance is a bit mor^ldiff icult to establish here 
since the data collection instrument is not a nprmed test with which we 
have broad experience. However, th^se findings indicate that teachers, 
who were trained in jppropriate test 4dministrat ion techniques were 
demonstrating those skills* substant ial ly more ^frequently than untrained 
teachers. 

Student Behavior 

Data for student on-task behav'ior were analyzed by ANGVA across 

i 

treatments and the findings presented in Table 35 sho* -no satistically 
significant main effects or interactions. Differences this large would 
be obtained between two groups- more than 10 times in 100 if samples of 
this size were randomly drawn from the same population. 

Based on these findings, it appears that treatment conditions (RE, 
TS, and TT) did not influence the degree of on-task behavior displayed by 
the lowest five achievers *in each^classroom. Means presented tn Table 36 
show* very little difference between treatment and non*t reagent groups 
for each factor. In order, treatment and <noi\treatment*on-task means*" 
were 72.9 and 73.9 (TS), 75.6 and 71.2 (RE), and 75.5 and 71.2 (TT) > m 
Individual student percentages of on-task behavior ranged from 35% to 

100%. 



ERIC \ 



IBS' 



Table 35 



Summary o'f Three-Way ANOVA of* Treatment Conditions on 
Observations of Student Behavior During RBWS 



Source 



SS 



df 



MS 



Trained Students 

Trained Teachers 

Reinforced Students 

TS X TT » 

TS X RE 

TT X RE , 

TS X TT X RE 

Error 

Total 



5.70 
110.08 
123.31 
25.01 
14.57 
181.50' 
J24.67 
1271.30 
1856.14 



1 
1 
1 
1 
1 
1 

r 

16 
23 



5.70 
110.08 
123.31 
25.01 
14.57 
18-1.50 
124.67 
79.47 



.07 
1.39 
1.55 
.32 
.18 
2.28 
1.57 




t 



153 



Table 36 

Mean Percent of Student On-Task Behavior 
During RB and WS 
.By Treatment Condition 







Trained Teachers 


Untrained Teachers 

« 


X TS *72.9 
SD TS = 10.4, 

Trained 
Students 


Reinforced 
Students 


73. 9" 

76.3 

72.0 

X = 74.1 SD = 2.2* 


73.1 
95.4 
64.8 

xV 77.8 SD =15.8 


Unreinforced 
Students 


68.8 
80.4 
84.8 

X = 78.0 SD = 8.3 


59.6 > ■ 
61.5 J 
63.8 

_ * 

X =61.6 SD = 2.1 


Untrained 
Students 

X_ JS = 73.9 

SD. TS = 7.8 

i 


Reinforced 
Students 


74.7 
81.7 
71.5 

X = 76.0 SD = 5.2' 


59.8 
^86.4 
77.8 

X=74.7 SD =13.6 
if 


Unreinforced 
Students 

f - 


73.4 — ^ 
•Tk 81.7 
^ 66.8 

X = 74.0 SD = 7.5 


"65.1 
77.7 
.69.6 

X =70.8 SD = 6.4 



X n = 75.5 
SD n = 5.6 



X. n -71.2 



SD 



TT 



11.3 



re 

Ve : 



75.6 



y 



-re 



-re 



71.2 
■ 8.8 



Note: The three numbers in each cel'l represent the mean percent student 
on-task behavior per classroom. < 



n 



ERJC 



. • 17.0 



Test Booklets 

Student test booklets were examined for invalid marks 'made by 
students that would influence the machine-scored results. Data were 
analyzed by error type across treatments usir^p ANOVA. Statistically 
significant differences were observed for the number of erasures made by 
observers andj terns left blank (not answered) by the students. Erasures 
\ere defined as the removal of any mark on the test booklet that was not 
a part of an answer fill-i/i. These marks may have been read as answers 
during machine scoring,. so they were erased by observers. Blank items 
were defined as questions with no answer filled in by the student. 
Erasures, and blank items were efitered Into the analyses by mean number 
per booklet per classroom. 

Results from ANOVA on erasures (Table 37) show statistical 
significant main effects for trained students, Fjl, 16) - 7.51, £ < .02) 

- .227). The mean number of erasures per trained student (Table 
38>) was 13.66 (SO = 5.12f and per untrained student, 35.43 (SD = 24,81J 
indicating that untrained students made significantly more m4rks that 
would invalidate the results from machine scoring than trained /tlWents. 
Due to^the emphasis during student training on filling out machirfe 
scorable answer forms, a large difference would be expected in ttte number 
of erasures needed by booklets from untrained' as opposed to drained 
classrooms. This evidence suggests that part of the student training 
(answer format) was successfully communicated but was apparently 
unrelated to student scores because of the careful way in which booklets 
were corretted before scoring. 

Table 39 presents the result* for ANOVA on the number of items le\ft 
blankj>er student. A statistically significant main effect was* found for 
TT(£ [1. 16] 5 9-79, £ < .01). The' estimate of «*2 indicates that 23.6% 

171^. ; . 



Table 37 



Summary.of Three-Way ANOVA of Treatment Conditions on 
the^Number of Marks That Required Erasing Before 
Test Booklets Were Machine Scored 



Source 


SS 




df 


MS 




F 


r 

Trained Students 


2845 


■ 

08 


. 

1 


2845 


08 


■$7.51* 


• Trained Teachers 


620 


34 


1 


. 620 


34 


*1 .64 


Reinforced Students 


m 

* 97 


98 


1 _ 


97. 


98 


-.26 


TS X TI- 


739 


74 


1 


'739. 


74 


1.95 


TS X RE 


34 


56 


1 


34 


56 


.09 


n X RE 


1 


41 


• 1 


1 


41 


.00 


TS X TT X RE , 


58 


.78 


1 


58 


78 


.16 


Error 


6064 


.70 


16 


379 


04 




Total 


10462 


60 


23 






4 



*£<.02. 



* 

4 



/ 



172 



Table 38 

Means- and Standard Deviations ^or Number of 
Marks that Required Erasing Before Test 
Booklets Were Machine Scored 



J SD ■ 

+TS Classrooms 13.66 5.12 



-TS Classrooms 35.43 2S.81 

All Classrooms 24.55 21.33 



Table 39 

* 

Summary of Three- Way ANOVA of Treatment Conditions 
on* the* Number of *est Items Left Blank by the Students 



Source 


SS 


df > 


MS 


F r 


Trained Students 


5.81 


1 


. 

5.81 


.73 


Trained Teachers 


77.81 


1 


77.81 


9.79** 


Reinforced Students 


4.34 ' 


1 


4.34 


.55 


TS X TT 


.22 


1 


.22 


-.03 


TS X RE 


3.93 


1 . 




.49 


TT X RE 


9.83 


1 - 


9.83 


1.24 


TS X TT X RE 


58.37 


1 


5*. 3 7^ 


7.34* 


Error 


127,20 


16 


7.95 " 




Total 


287:51 


23 







*£<.02. 
**p_<.01. 



1 



74 



4 i 158 

of the vdridnce in blank items was accounted for by training teachers. 
Means listed fn Table 40 indicate that a difference of 3.61 items per 
student distinguishes trained teacher an(J untrained teacher conditions. 
These results provide some evidence as to the effectiveness of the teacher 
training* package In communicating the importance of answering asmany * 
questions as possible. * 

Summary 

Although not originally included as a part of the "State Refinements" 
workscope, and paj4 t for largely out of other sources, this component of the 
project prcrvides important information about the questions the project was 
designed to address. ' More specifically, the project was responding to 
concerns among SEA and LEA personnel that the results of Title I evaluation 
since the' implementation of TIERS hawe seemed inconsistant with historical N 
estimates and more variable from year to year in the same prog^c^^hafi seemed 



reasonable. This component of the project identified anc 



^ed empirical 



evidence about three factors which may be partially responsible for the 

- / 

desc^ppencies in Title I evaluation results. Although somewhat different from, 
the factors the project was originally designed to^address (i.e., the validity 
of Model A and the degree to which assumptions underlying Model A are^being 
violated in Utah schools), these' factors are nevertheless important because 
they impact on the results of alj the evaluation models and,\n fact, can 
influence the results of any evaluation which depends on the administration of 
standardized tests. 

The results of this component of the project present convincing evidence 
that the way in which a standardized test is 'administered and the degree to 
which .students are motivated to do well on the test is substantially related 



175 



159 



Table 40 

Means and Standard Deviation for Number 
of Test Items Left B1,ank 



X SD 

+TT Classrooms' 3.86 1.86 

-TT Cl-assrooms 7.47 3.95 

All Classrooms 5.67 3.54 



r 



9 

ERLC 



160 

to the scores that students receive. Since test scares are frequently 
interpreted. as <m indicator. of what students know, these data indicate that 
other factors besides "Knowledge are playing a significant role in determining 

* 

^^stu^nts ^scores . Consequently, students' scores on standardized achievement 
tests must be interepreted cautiously and with reference to such factors as 

w. 

motivation and proper test administration procedures. 




ERIC 



177 



161 

CHAPTER VII 



I SUMMARY, CONCLUSIONS, AND RECOMMENDATIONS 



The previous chapters of thj^eport have described the procedures and ' 
results t of a project undertaken by the Utah State Office of Education with 
support from the United States Department of Educatjon to make "State 
Refinements to the ESEA^ Title I Evaluation and Reporting System." Fundinq 
for the "project was awarded under competitive bidding procedures 'in response 
to RFP. ■ ' I 

If * 

The project was motivated primarily by a perception among LEA and SEA 
staff responsible for Title I programs that the results of Title I evaluations 
in Utah since implementation of the Title Revaluation and reporting system 
(TIERS) appeared to be inconsistent with previous Title I results and at 

. times, inconsistent for a given project from year to, year. These perceptions 
motivated the question which this project was designed to answer: Are Title I 
evaluation results obtained with Model A of the TIERS accurate and believable? 

The project's workscope was designed to provide information about this 
basic question and consisted of three parts*:- (a) an extensive review of the ' 
literature which has considered, both empirically and philosophically, the % 
validity W TIERS evaluation models; (b) an empirical comparison of the 
estimated impact of a single Title I program using both models A and B {since 
Model B is assumed to be the more rigorous, of the TIERS models,' if both A. and 
B were properly implemented and yielded different estimates of program impact, 
the results could most plausibly be attributed to, weaknesses in model A; (c) 

% investigation of the deqree to which assumptions of Model-X are^bein.g * 
violated during implementation of Title I programs in Utah. 

As the organized workscope was implemented, two additional components 

r 

**• 

— ,1 . 



* ' - * * ^- o 

,wgre dddfed. Funding for these components came frorrrcontributed resources from 
the Utah State Office 6f Education and Utah State University; small grant frorn^ 
• the Office of Special Education; 'and, more efficient ilses of the resources • 
, . resulting ^om the contract with the Department of Education. These " • 
additional components .included (a) an investigation of the effect of item 
format on students 1 scores on standardized achievement t^sts and (b) the 

r S effect s*»or\ -standardized achievement test performance of training 

f % . 

' teachers, training students and mot^atiog students. 

The review of previous literature about the Title I Evaluation, and 

Reporting System was conducted. primari ly to establish a foundation on which 

the other componenfl^of the project could build. The remainder of this 

. • * •** 

chapter summar jzes'the major findings and recommendations resulting&^from the 

project'. Addiffeonal detail regarding each of the«0roject components is ^ 

, provided in previous chapters. 

■i 

. A Comparison of T IEj^tod'e 1 s A and B . 

Findings . Not only jn Utah, but nationwide, Model A i^ tlie most 
frequently % used Title! evaluation model ^and throne j^hich is most Reasonable 
for LEAs to implement. -. However, the results of this^jdy indicate" that Mpdel 
A, , even when it is # corre&"W^ implemented, -appears .to overestimate the impact 
of a Title I -program.* Depending upon' the grade level, and content area, this . 
- etferestim'ation ranges! from one-sixth of a standard deviation to more than 
one-fTalf of a standard deviation. Such differences in estimated impact can 
hardly be j?6nsifte*«d^r,ivial : 

'^^A^important, however, ar£ the daWwhich suggest that # either Model or 
JldHel B can be .implemented correctly ^t in different ways ^and yield very / 
different estates of impact for the t same project. Thes.e differences within * 

,' ERJC ' • * * , ~ '~ ^ 




models -occur primarily' becaus? within a giverf model, the pool of students 
on which impact is being estimated .may include substantially different 
students. - • 

. The. two different select-ion "methods used for of Model A in-this sjudy * 
.resumed in 50% to 75% mor^students being included in the second selection . 
method, although both were appropriate. 'The three different methods used to 
select students for Model' B, although all technicYlfy correct, also .resulted 
in different groups of students* being considered.'. Especially where mobility 
and attrition are hfgh, ^s would f^^guently be the case in Title I programs, 
the differences in these implementation methods for either Model A or Model B 



can r^ult in very different e^Mrates of program impact for a parHftjlar 
Title I* program.. i ^ 

The point has been made that attritioj^ is not a. major concern with Model 
A because data are obtained for the students who are sit ill in % the program at 
'the end of the year. However, it js important to note that the TIERS* models 
result in con|lusions^abp*ut Jjmpact of a total, program and not about thehmpact 



of a program on a given irflri vidu^student . Depending' on how the- evaluation 

model i^ implemented, ,the students for whom data are available may be very 

different studeats from the Students for whom data are 'not available. Such 

differences in the student population being considered in the evaluation data 

may lead to very different cont^u^ons about program effectiveness. If 

^valuation data are to be used to make pr&granftna^ decisions about 

continuation and/or improvement, such differ^noes are very import^t and 

can/iot b§ ign#ed in the decision making* process. /* , - - 

Recommendation^ ! The findings of this component of the project^ place the 
^- • / * ^ " 

State Office of Education in somewhat 'of the dilemma. Although the results • 
indicate that Model A may be overestimating the impact of Title I programs, it 



• • ' » 

is unreasonable to expect that most school districts in the state will be in 

* situations*that would allow them to use Model B or 'Moc^lc. At this point in 

time v/hrtually all of the districts are optinq for Model A. 'Rather than have 

some districts using Model A, some Model B, ancJ spme Mocfel C, when there is' 

evidence to sugqest that data from the three models are not comparable it x . 

would make more sense for the state to encourage all districts to use model A. 

Although this may lgad to some overestimation of project impact, at least data 

from projects will be comparable from one project to another. 

In addition, the State Office should c|^nue to work with LEAs and 

inlist the support of the Technical Assistance Center in. working with 

districts as they implement Model A to ensure that the methods used for 

including students data in the evaluation results are explicit and well 
* * 

defined. This recommendation focuses on a different issue than the formal 
procedures' and objective tests used to Select students for participation in 
Title % I proqrams. The results of this project suggest that even after 
students have been properly selected"for participation in the Ti\le I proqram 
mobilit/, attrition, and absenteeism, contribute to difficulties in includinq 
many students' results in TIERS. Since the results of TIERS are used to make 
*^tatements about' the total program, it ifc >n#ort£it for districts to have as 
many children* who participated Jn the Title I program as goss'ible included in 
the reportinq of the evaluation results. 

Degree to Which Assumptions Made by Model A are Met in Utah Title I 
Ev-aluafcions . - ~ ' : 

Findings .- 8ata from this component of the project have indicated that 

• * » 

most of the mechnical assumptions of Model A (e.g., separation' of selection 

\ 

and, pretest, test inqj n^er the empirical" norming date, and usinq appropriate 

. y ... r v 

most districts. The 



leveljs af the iestj^arl ad^ed^o reasj^bly well by 



165. 



most serious problem was the degree to which appropriate selection measures 
were used for those students who moved into the districts after "the majority 
of tKe % Title I^udents had been selected, particularly in d$tricts where the 
majority of the selection occurred" during the Sprinq. It appeared that a 
substantial ntmber of students may be selected for Title I programs in ways 

different from the Spring selection procedures that may violate the guideline 

* - '1 ^ 

of s-eparating selection test and the pretest. Furthermore, it was disturbing 

that many Title I directors and other LEA Title I personnel did not have a* 

clear understanding of JIEIjtS requirements for the Model they were using. 

.4 

Although most districts appear to be following the mechanical assumptions 
*of Model A reasonably well, thefe are* other assumptions which are somewhat 
more subtle but in man^way^more important, ' that appear to be violated 
frequently.^ Gener-alJy accepted test • administration practices are frequently 
n6t followed and factors- such as the match between the instructional emphasis 
ancfrthe emphasis of the standardized test .used by the district 'are frequently 
sources ^f difficulty. , Consequently, eyerj though the TIERS model* may be 
implemented "correctly" the results of the evaluation regarding the impact of* 
Title I proqrams may be difficult to interpret. . * 

Recommendations . The importance of following the guidelines for, 

• * " > m t 

implementing Model A should be continual ly emuhasi zed. Mor:e importantly, 
however, the State Office and district Title I directors should focas * 
additional attention on follcfwiwq standardized test adrrflni strati on procedures 
and selectinq tests which emphasize tie sarfie factors being emphasized in the 
Title I instruction..^ Regardless of which Title I evaluation .Model vs utilized 
these factors could substantially impact £?Vthe results. 



166 

0 

: o rniat on Studeni 

Achievement Testing " 

- Findings ^ An, analysis of standardized achievement tests frequently ujed 

in conjunction with Title I evaluations revealedrhat different tests use 



The Effect offcem Format on Students' Scores from Standardized 
TWtTr 




different types of items to assess students'* mastery of the same content. In 
two studies where groups of Title I students were asked to answer questions 
about identical content .using items that had been written ^ the various 
formats from frequently used standardized achievement tests, it was found that 
the type of format used'to ask the question accounted for almost 
three-quarters of a standard deviation difference in students' scores. 

These dat^ suggest that what .a student appears t;o know based on' the 
results of a standardized achievement test may be inf luence^ heavily by the 
particular format used by that'test in addition to what' the student really 
does know about the content.. The reason for .differences between types of 
format was not addressee) specifically in this study but. it may well be that 
- students have greater dif^ulty wi£h formats with whteh they are unfamiliar. 
Consequently, not- only the match- between instructional emphadfl's^and the 
emphasis. of the standardized tests is important, -but *al so it is important that 
the students be familiar with the types of formats in which Questions will be 
asked? . ~ < 

Recommendations . The State Office should continue to investigate the 
^effect .jtflm format on "standardized testing results and emphasize to Title I 
personnel thi importance of making sure that students are familiar with the 
^orrpats that *<j?l be iised in the particular standardized achievement tests 
used tpKthfeir, district, ^he extent to which factors (Other, than the "students 
kp&wfedge of th6 content being tested) .can be controlled and/or ejimlnated 
Vfr^ri the standardized testiriq, the more valid and useful results of Title I 

evaluations will ♦be. Currently,, it is difficult to know whether a student's 
\*low score is a result/of not knowing the content being tested or, results 



v 



183 



V ( 167 # 

from the student's unfanul iarity-with the particular format 'beinq us^ed to test 
the content. 

Effects on Standardized Test Performance of Training Teachers, Training 
. .Students and. Motivating Students . '. 1 ^ 

Mnd_ing_s. Usinq a true experimental design, 24 classrooms pf students in 
Title I Schools were randomly assigned to experimental and control conditions*.' 
on three factors: (a) training teacher^ in appropriate standardized testing 
procedures,^) training students in tesf>taking ski.lls, and (c) motivating 
students to do their best on standardized tests. The resul^ of this stud^_X"i^" 
indicate tyre students who are tested by trained administrators and/or who ,are * 
motivated/to do their best on the test, do substantially better than" students 
who are not. Coupled with the information from the projectis on site\isits »' 
which suggested that procedures for^wtiich standardized test admini-st ration, .in 
Title I evluations were frequently violated, data suggest that test 4^ ^ 

administration techniques and student motivatiorlNariables' may be^cfr^lounding 
the results of Title I evaluations. Students who were mottrete* to perform 
welT on the achievement tests scored ilmost one-half of a standard deviation 
above those who were not. Students who were administered the test by traipW * 
administrators scored almost one-quarter of a standard deviation above those * 
who were not. 

Recommendations . Continued effort needs, to be made to asstfrg that' th^se' 

f ^ 4 ^ ' ' - 

people responsible for administering standardized achievenents tests anrJ'TUle^ . 
*I evaluations are prtfperly trained and follow appropriate ,^andannzefl. £es[ttng ■ 
procedures, furthermore, efforts need to be made to motivate' stqdents ^try 
their best^on the achievement tests. The methods used in this study (i>fc.; *' 
paying students for scorinq higher on the test t*wCH\ad be'en pt'edtctetf rfrom a- 

I R1C ' . . 191 J •' .' .. -'■ 



168 



if 

•. pretest) are .obviously not appropriate as a standard practice. However' other, 
. more practical -procedures, need to be investigated and empirically tested. If 
' students don't care whether they do well on a test, the results of that tes't 
can hardly be used as ameasure of program impact. 

* • , i 

Summary j , ' 

» / , t v 

The results or* this project raise a number of guestions regarding the ; 
interpretations of Title I ev^a1uation results. Some of those guestions (e.g., 
the apparent inflated estimates of impact using model A) apply only to a 
-' particular model while other concerns (e.g., lack of adherence to standardized 
testing procedures,, effects of student rr/tivat'ion and ftem format) cut . across 



/ 



.all evaluation models. 



V Any type^f state-wide or national evaluation system is bound to 'be 'i 
c-pmplex. .The- complexi ties in/the Tiers as indicated by this research are of 
t greater magnitude thar>many people have assumed and should C be considered 
canefulTy in- Interpreting the results of, Titl^ I evaluations. The solution, 
yJou-slyjViot to-dlscard ,al 1 evaluation. Evaluation is important if 
. determination are 'to be'made .ab6 ? ut effectiveness. Howeve>, these results do 
/ indicate' that we must be rriore'careful in' implementing the evaluation models 
'.#ncMn interpreting the results of -those models. 

Furthermore', any evaluation system which util.izes standardized 
$ achievement 'testing to draw conclusions about how much students, know in a 
particular content area must take into consideration the. results of this 
research. . Based on ihese.data, it" appears that a number of other factors 
(e.g., the Way in which the test is administered, the student's -level of 
mot ivation^h^ype'ofj' format used by the' part icul ar achievement test) are 
sjjbstjpfc^lly related to students Vscoce on an achievement' test besides rtat a 
Vudent actually knpws.^ Unless. the^e other factors can be eliminated or " 

\v * •• - ' * • • ■ 

\ v . - r : • - " *, 

Li ±**x± ^ 50 



169 



controlled, it is difficult to tell how much of a student's score is a 



function of his or her knowledge and how much is a function of these other 
factors. Unless this can be- determined, evaluation results, reqardless of 
which, Title I evaluation model is used, 'will be difficult to interpret . " 

This project has not provided easy or definitive answers to 
questions original ly motivated the study. Instead, it has provided a 
variety of data which should make Title I administrators in both SEA's 
and LEA's more careful in implementing Ti,tle I evaluation and more 
cautious in* interpreting fhe result Most importantly, however, it 
has more clearly defined some important questions wfikh need further 
investigation if the results^of most Title I evaluations are to be 

« 

clearly interpretable. > 



— ^ 



i s 



170 



REFERENCES 



Armor, D. , Conry-Osequera,. P. , Cox, M. , King, N . , McDonn^H-, L., Pascal, A., 
•Pauly, E^, & Zellman, G. Analysis of the school preferred reading 
program ™n selected LoA\ngeles minority schools. Report prepared for" 
the Los Angeles.Unifi'eeTSchool District, R-2007-LAVSD, Rand Corporation, 
Santa Monica, CA, 1976. _In S. Murray, J. Arter, & B. Faddis, Title I 
technical issues as threats to internal validity of experimental and 
quasi-experimental designs: annotated bibliography . Paper presented at 
th^Sannual meeting of the American Educational Research Association, San 
Francisco, April, 1979. 

* 

Arter, J. A., & Estes/G. D. A model for developing local norms with a 

standardized achievement measure for use. with local pr&jrem evaluation: 
procedures and effects. Paper presented at the annual meeting .of the 
Am§Mcan Educational Research Assoc iation;" Toronto, Ontario, Canada, 
March 27-31, 1978. 




Baker, R & Williams, T. Issues related to interpolation. >n Report of the 
, committee to examine issues related to the use of the norm referenced 
model for Title I evaluation. Compiled by J. B. Hansen. Portland, 
Oregon: Northwest Regional Educational Laboratory, October, 1978.' 

Barcisk&wski, R. S., & olsen, H. Test item arrangement and adaptation level. 
. Jhejojjraal of f sycholoqy , 1975, 90, 87-93. 

•Barnes, R., E. , & Ginsburg, A. L. The relevance of the RMC models for Title I 
m policy concerns. Paper presented at the annual meeting of the American 

Educational Research Association, Toronto, Ontario, Canada, March 27-31 

1978. 

Benson, B. , &tf rocker, L. The effects of item format and reading ability on 
objective test performance: A question of validity. Educatio nal and 
Psychological Measurement , 1979, 39, 38-1-387 . ~ * 

Brldgeman, B\ Extrapolation and interpolation in Model Al JitTe I evaluation. 
In Report of the committee to examine issue* related to the use of the 
norm referenced model for Title I evaluation. Compiled by J. B. Hansen. 
Portland,- Oregon: Northwest Regional Educational Laboratory, October. 
.1978.- • • • , ^ 

Burton., B. Model Al and the regression effect. Memo to TAC stiff, May 8, 
1978. _In S. Murray, J. Arter, & B. Faddis, Title I technical issues as 
t hreats to internal validity of experimental and-quasi-experimental" 
'designs: 'annotated bibliog raphy. Paper presented at thp annual mppting 
of the American Educational Research Association, San Francisco April 
1979. 

Carcelli, L. , Taylor, C. , & White, K. The effect of item format on phonics 
subtest scores of standardized reading achievement tests. Paper 
• presented at the Annual. Meetinq of the American Educational Research 
Association, 1981, Los Angeles, California. 



137 



171 



Catts, R. How many options should a multiple choice question have? (a) 2. 
(b) 3. (c) 4.- At a glance research report . Sydney, Australia:. New 
South Wales Department of Education, 1978. (ER IC 'Document Reproduction 
Service 'No, ED 173 354) 

Cochran, W. G. The use of covariance in observational.. studies. Applied 
^Statistics , 1969, 18, 270-275. J_n G. Echternacht & S. Swinton, Getting 
^ •Straight! - everything you always wanted to know about the Title I 
g regression model and curvi 1 1 inear ity . Paper presented at the annual 
* meeting of the American Educational Research Association,. San Francisco, 
Calif., April 8-12, 1979. 

Coleman, J/ S.\ Campbell, E. Q. , Hobson, C. J., McPortland, J., Mood, 'A. M. , 
Weinfeld, F. D_. , & York, R. L. Equality of educational opportunity . 
U.S. OHEW, Office of Education, Washington D.C. : Government Printing 
-Office, 1966. J_n S. Murray, J. Arter, & B. Faddis, Title I technical 
issues as threats r to internal validity of experimental and 
quasi-experimental designs . Paper presented at' the annual meeting of the 
American Educational Research Assoc i ation,, San Francisco, April, 1979. , 

Conklin, J. The effects of date of testing and method o'f interpolation on the 
use of standardized test scores in the evaluation of large- scc>le 
Educational programs. Paper presented at the annual' convention of the 
American Educational Research Association, San Francisco, April, 1979. 

Crane, L. R., &*Cech, J. Title I evaluation models Al and Bl: an empirical 
comparison. Paper -presented at the annual meeting of the American 
Educational Research Association, San Francisco, £ftp+TT^9'79. 

Cronbacfv, L. J. Test validation. In Thorndike (Ed.), Educational 

measurement . Washington, D.C: American Council on Education, 1971. 

Crowder, C. R.^ Gallas, E. J. Relation of out-of-level testing to ceiling 
and floor effects oo third and fifth grade students. Paper presented at 
the annual meeting of the American Educational Research Association, 
Toronto, Ontario, Canada, March 27-31, 1978. 

Dambrot, F. Test iterq order and academic ability^ or should you shuffle the 
test item'deck? Teaching of Psychology , 1980, 7_ 9 94-96. * 

David, J. L. , & Pelavin, S. H. Evaluating compensatory education: Over what 
period of time should achievement be measured? Journa l of Educational 
Measurement , 1*978, 15(2), 91-99. 

DeVito, P. J., & tong^ J. V. The effects of spring- sprinq y$. fall-spring 
* testing upon the , evaluation of compensatory education programs. T^per 
presented at the annual convention of the American Educational Research 

Association, New York City, Ap^il, 1977. 

» 

Doherty, W. Restandardi zation study. System Development Corporation* 
(undated). In S. Murrafy, J. Arter, & B. Faddis. Title I -technical 
i issues as threats to internal val idity. of experimental and ~ 
quasi-experimental designs . Paper presented at the annual meeting of the 
American Educational Research Association', San Francisco, April, 1979. 



13* / 




•V* — ■ ; 17? 

Echternacht, 6. Model C is feasible for ESEA Title I evaluation. Paper 
presented at thQ annual meeting of the American Educational Research 
Association, Boston, Mass., April 10, 1980. 

Echternacht, G., R Swinton, S. Getting straight: Everything you always 
wahted to know about the Title I regression model and curvil inearity. 
Paper presented at the annual meeting of the American Educational 
Research Association, San Francisco, Calif., April 8-12, 1979. 

Estes, 'G. 0., & Anderson, d. I. Observed treatment effects with special 
'regression evaluation models in groups with no treatment. Paper 
r presented at the annual meeting of the American Educational Research 
Association, Toronto, Ontario, Canada, March 27-31, 1978. 

^ Estes, W. K. , & OePolito, F. J. Independent variation of information storaqe 
and retrieval processes in paired associate lear-ning. Journal of 
Experimental Psychology , 196>, 75_, 18-26. 

Caddis, B. J., Arter, J. A., & Zwertchek, A. An empirical comparison of ESEA 
Title I evaluation models A and B. Paper presented at the annual meeting 
of the American Educational Research Association, San Francisco, April 
8-12, 1979. 

Faddis, 8. J., & Estes, G. 0. Fa] 1-to-spring vs. fall-to-fall evaluation of a 
large Title I program with a comparison group design. Paper pr*ented at 
the annual meeting of the American Edifcational Research Association, 
♦Toronto, 1978. > 

Fagan 4 , B. M. , & Horst, 0. P.' Selecting a ,nosm-ref erencgd test . Mountain 
View, CA: RMC Research gbr por at i on, 1978. In R. T^ Jotoson & W. P. 
Thomas, User experiences in implementing thp~~RMC Title Pbvaluation 
models. Paper presented at the annual meeting of the*American 
Educational Research Association, San Francisco, 1979. 

* 

Fish, Q. W. An analysis of the evaluation data when ESEA, 'Title I evaluation 
models Al and £2 areSfempirically field tested simultaneously. Paper 
presented at the 'annual -meeting of the American Educational Research 
Association, San Francisco, California, April 8-12, 1979. 

Fishbein, R. L. The use -of non-normed tests in the ESEA Title I evaluation 
-and reporting system: Some technical* and pol icy« issues. " Paper -presented 
at the annual meeting of*tt\e American Educational Research Association, 
Toronto, Ontario, Canada, March 27-31, 1978. 

Frisbie, 0. A. Multiple choice versus true-false: A comparison of 
reliabilities^ and concurrent validities. Journal, of Educational 
Measurement , 1973, 10, 297-304. ' , 

Gabriel, B. R., Stertner, A. J.,* Troy, J. B. An empirical examination ' of 

three models for estimating the effects of no^treatment^ Paper presented 
at the American Educational Research Association Convention, New York 
City, April 6, 1977. 



173 

X5erow, J. R. Performance on achievement tests as ^function of the order of 
item difficulty. Teaching of Psychology , 1980,-7, 93-94. 

Glass, G. Memo to: unknown, interested parties, jn Report of the committee** 
to examine issues related to the use of the norm referenced model for 
Title I evaluation. Compiled by J. B. Hansen. Portland, Oregon: 
Northwest Regional Educational Laboratory," October , 1978. 

Glass, G. V. Integrating findings: The meta-analysis of research. Review of 

Research in Education , .1977 . 5, 351-379. , — " 

s * ' 

Glass, G. V, & Hakstian, A. R. Measures of association in comparative 

experiments: Their development' and interpretation. American Educational 
Research Journal , 196,9, 6(3), 403-414. 

Goldman, J., & Crane, L. R. Title I .model B adjustment procedures: which' to 
^ use and when. Paper presented at th^ annual meeting of the American- 
Educational Research Association, Boston, April 7-11, 1980: 

Grier, L. B. The number .of alternatives for optimum test reliability. 
. Journal of Evaluation Measurement . 1975, 12, 109-112. 

t 1 

Halpin, G., & Halpin, G. Retention in 'an actual classroom setting as a > 
function of type and complexity of tests . (ERIC Document Reproduction 
Service No. ED 183 622) 

Hardy, R. A comparison of Model A and Model C: Results, of first year 

■ implementation in Florida. ETS, Evanston, 111. Personal cormiunication , 
• January, 1979. Jjn S. Murray, J. Arter, & B, Faddis-, Title I technical 
issues as threats to internal validity of experimental and 
' quasi-experimental designs (Paper presented at thp annual rating 0 f the 
American Educational Research Association, San Francisco, April, 1.979) 
and G. Echternacht, Model C is feasible for ESEA Title f evaluation 
'(Paper presented at the annual meeting of the American Educational 
Research Association, Boston, Mass., April 10, 1980). 

. - Hays, W. L. Statistics for the social sciences . New York: Holt," Rinehart 
and Winston, 1973. 

Hiscox, S. B., & Owen, T. R. Behind the basic assumption of Model A. Paper 
presented at the annual meeting of the American Educational Research 
Associations Toronto, Ontario, J^nada, March 27-31, 1978. 

Holliday, W. G. , & Partridge, L, A. Differential sequencing effects of test 

items on children. Journal of Research in Science Teachina • 1979 16 
/ir»7 /in '■■"■'■> 



* Horst, D. P., & Wood, C. T. Collecting achievement test data . Mountain View, 
CA: RMC Research Corporation, 1978. In R. i. Johnson W. P. Thomas,' 
User experiences in implementing, the RMC" Title I evaluation models. 
Paper presented at the annual meeting of the American Educational 
Research Association, San Francisco, -1979. 

House,-G. p. A comparison of Title I achievement results obtained 'under US©£— X 
models Al, CI and a mixed model. Paper presented at tile annual meeting 
of the American "Educational Research Association, San Francisco 
California, April 8-12, 1979. 



9 

ERIC 



i 



174' 



Johnson, D. D., Pittleman, S., Ackwenker, & Perry, J. • On bias diagnosis 
of reading difficulties - IV (Technical Report No. 464J ! RaJison: 
Wisconsin Research and Development Center for Individualized School inq, 

<+ 197°. (ERIC Document and Reproduction Service No. ED 173 355) • 

Johnson, R. T. , & Thomas, W. P.v^User experiences in implementing the RMC 

Title I evaluation models. P^per presented at the annual meeting of the 
American Educational Research p^soc iation, San Francisco, 1979. 

The Joint Dissemination Review Panel. IDEABOOK . Washington, D.C.: , 

Superintendent of Documents, U.S. Government Printing Office. October, 
1977. - , , 

4 

Kaskowitz, D. H., & Norwood, C. R. A study of the norm referenced procedure r 
for evaluating project effectivenesses applied in the- evaluation of 
project information packages. Stanford Research Institute, Research 
Memorandum URU-3556, Menlo Park, CA, January, 197_). In S. Murray, J. 
Arter, & B. Faddis, Title I technical issues as threats to internal • 
validity of experimental and quasi-experimental designs: Annotated ' 
bibliography . Paper presented at the annual meeting of the American 
Educational Research Association, San Francisco, April, 1979. < J 

Kenny, D. A. A quasi-experimental approach to assessing treatment effects in 
the non-equivalent control group design. Psychological Bulletin , 1975, 
82, 345-362. Ln J. Goldman & L. R. Crane, Title I Model B adjustment 
procedures: which to use and when. Paper presented at the annual 
meeting of the American Educational Research Association, Boston, April 
Ml,- 1980. 

Kierscht, M. S., & Vietze, P. -M. Test stimuli: Representational level with 
middle class and head start children. Psychology in the Schools, 1975, 
12, 309-312. 

Kintsch, W. Reooqnition and free recall of organized lists. Journal of 
Experimental Psychology , 1968, 78, 481-487. " 

Kumar, V. K.,^abinsky, L. , & Pandley, T. N. Test mode, test instructions, 
and retention. Contemporary Educational Psychology , 1979, 4, 211-218. 

^ ~~ 

Lai-Min, P> S., & Coffman, W. E. A study of "I don't know" response in 
multiple-choice tests . Iowa City, Iowa: Iowa Testinq Programs, * 
. University of Iowa, 1974. (ERIC Document Reproduction Service No. ED 
141 371) 

• f , . ■ 

*> 

Linn, R. L. The validity of inferences based on the proposed Title I 

evaluation qjodels. Paper presented as part of a symposium at the annual 
meeting of the American Educational Research Association; Toronto, < 
Canada, March 27-31, tt*8. / • 

Linn/R. L. , & Werts, C. E. Analysis of implications of the choice of a 
structural model in the non-equivalent control group design. 
Psychological Bulletin , 1977, 84, 229-234. 



9 

ERIC 



13i 



r 



475 



Loftus, G. R. Comparison of recognition and recall in a continuous memory 

• task. Journal of Experimental Psychology , 1971, 91, 220-226. 

Long, jL, Horwitz, S., & DeVito, P. An empirical investigation of the ESEA 
^ntle I evaluation systems 1 proposed variance estimation procedures for 
use with criterion-referenced tests. Paper presented at the annual 

• meeting of the^ American Educational. Research Association, Toronto, March, 
1978. 

Long, J. V., Schaffran, J. A. , & Kellogg,-!. M. Effects of out-of-level 

^Urvey testing on readinq achievement scores of Title I, ESEA students. 
Journal of Educational Measurement , 1977, U4, 203-213. , 

Mayeske, G. & Beaton, A., Jr. Special studies of our nation's students. 
Washington, D.G.: Government Printing Office, 1975. Murray, J f 

Arter;/ & B. Faddis, Ti tle I technical issues as. threats t^ internal 
validity of experimental and quasi-experimental designs . 'Paper presented 
. at the ^nnual meeting of the American Ecjucat iona f l Research Association, 
San Fr arte i sco, April, 1979. ^ 

Millman, J., \ Setijadi. A comparison of American and Indonesian students on 
three typ^s of test items. Journal of Educational Research , 1966, 59 , 
273-275^1 r 

Mishra, S. P. Influence of the examiner's ethnic attitudes on intelligence 
test scored Psychology in the Schools , 1980, 17, 177-122. 

//torray, S., Arter, J., & Faddis, B. Title I technical issues as threats* to 
internal validity of experimental and quasi-experimenta.1 designs: ~ 
Annotated bibliography . Paper presented at the annual meeting of the 
American Educational Research Association, San Francisco, April, 197* 

Nogqle, N. L. Al t'^Vati ive" norms for model Al. NWREL/TAC, 1977. Draft. In 
S. Murray, J.' Arter, & B-F^ddis, Title I technical issues as threats to 
internal validity of^ytTgrimental and quasi-experimental designs: \ 
Annotated bibliograp^ . Paper presented at the annual meeting of the 
American Educational Research Association, San Francisco, April, 1979. 

0*enne, A. The effect of vertical scaling imprecision in \he estimation df 
Title I project gains. Paper presented at the annual meeting of the 
/Vnerican, Educational Research Association, Toronto, 1978. In S. Murray, 
J. 'Arter*, & B. Faddis, Title t technical issues as threats to internal 
validity of experimental andt quasi-experimental designs" ! Paper presented 
at the annual meeting of the American Educational Res^rch Association, 
San Francisco, ApriJ, 1979. f 

PierseV, W, C, Bredy^ G. M. 9 & Kratochwill, T. R. Further examination 6f 
motivational/ influences on disadvantaged minor ity group children's 
intelligence performance. Child Development , 1977, 48, 1142-1145. . 

Poaqe, M., & Poage, E. G. Is one picture wprth-a thou sands words? Arithmetic 

Teacher, 1977, 24, 408-414. 1 
T" — - 



ERjC ^ . > 



. « 176 
t 

Porter, A. C, Schmidt, W. H./Floden, R. E. , & Freeman, D. J. Practical 
+ ■ significance in program evaluation.* American Educational Rese arch 

Journal , 1978, 15, 529^539. Ir?S. Murray, J, Arter, & 5. Faddis, Title I 

. technical issues as threats to internal validity of experimental and" ~ 

quasi-experimental designs . Paper presented at the annual meeting of the 
American Educational Research Association, San Francisco, April," 1979. % 

Powell, 6., Schmidt, J., & Raffeld, P. The equipercenti le assumption as. a 
pseudo-control group estimate of gain. Paper presented at the annual 

* meeting of the American Educational Research Association, San Francisco, 

* Calif., April 8-1?, 1979. . .. 

• 

Powers, S..-& Gal las, E. J. 'Implications of out-of-level testing for "ESEA 
Title I sttfdents. Paper presented at the annual meeting of the American 
. Educational Research Association, Toronto, Ontario, Canada, March 27-31, 



Report of the committee to examine issues related to the use of ' the norm 

referenced model for Title I evaluation. Compiledjby j. b. Hansen.". ' 
.Portland, Ocegon: Northwest Regional' Educational Laboratory, October. 
1978. \ 

' Roberts'^S. J. Test floor and ceiling effects . Mountain View, CA: RMC 
Research Corporation, 1978. • 

•Roid, G. , & Haladyna, T. A review of item writing methods for criterion- 
referenced tests 'in the cognitive domain .- uklah'oma City. Oklahoma: 
. Paper presented at the Annual Meeting of the' Military *Testing 
Association, 1978. (ERIC Document Reproduction Service No. ED 178 562) 

Slaughter, H* B., & Gallas, E. J. Will out'-of-lefel norm- referenced testinq 
improve the selection of program participants anrd the diagnosis of 
reading comprehension in , ESEA Title I programs?' Paper presented at the • 
annual meeting of the American Educational Resear^i Association, Toronto, 

f Ontario, Canada, March 27-31, 1978, . ' 

. . ) • 

Stonehill, R. M. , & English, J. J. Measurement Concerns in Title I 

evaluation. Paper presented at the Large School Systems' Invitational 
Conference on Measurement .and Evaluation held Mn Alexandria, Viroinia. 
May 2, 1979. - ^ , * 

Storlie, T. R., Rice, -W. , Harvey, P., & Crane, L. R. fln empirical comparison 
of Title I NCE gains estimated with model Al and with model A2. ' Paper- 
presented at the annual meetinq of the American Educational Research 
Association,- San Francisco, April, 1979. 

4 

TaUmadge, G. K., & Horst, D. -P. The use of different achievement tests i'n 
the ESEA Title I evaluation system. Paper presented at the'. annual ' 
-meeting of the- American Educational Research- Association, Toronto, 
Ontario, Canada, March 27-31, 1978. * - 

Tallmadge; G. K. , & , Horst, D.-*>. Using the data from state and local- ESEA 
; Title I reports. Papefr presented at the annual meeting t>f the American 
Educational Research Association, Newjork City* April 4-8, 1977. 



Tal Imadge, G. K. ,T Roberts; A. 0. H. ■ Factors, that influence test" res ults. 
Mi. View, CA: RMt Research Corporation, 1978. In R. T. Johnson & W P 
•Thomas, 'User experiences in implementinq the RMCTTtle I evaluation 
models., fciper presented at the annual meeting of the American 
Educational Research Association, San/rancisco, 1979. 

Tal Imadge, G. K. , & Wood, C. T. Comparabi 1 ity of gains from the three models- 

in the Title I evaluation system. Mountain View, CA: RMC Research 
, Corporation, 1980.. 

'Tallmadge, G. K. , & Wood, C. ff. User's guide: ESEA Title I evaluation and 
reporting system. Prepared for U.S. Dept. of Health, Education, and" 
Welfare. Mountain View, CA: RMC Research Corporation, October, 1976. 

Taylor, C. Personal communication. February 20, 1981. 

t 

Towle, H K J., & Merrill, P. F. Effects of anxiety type and item difficulty 
sequencing on mathematics test performance. Journal of Educational 
Measurement , 1975, JL12, 241-249. 

Van Hove, E., Coleman, J. S., Rabben, K., & Karweil, N. Schools' performance- 
New. York, Los*Angeles, Chicago, Philadelphia, \etroit; Baltimore. 
Unpublished manuscript, Baltimore, Md., OctoberVl970. Reported in R. L. 
Linn, The val idity .of inferences based on the proposed Title I evaluation 
models. Paper presented as part of a symposium at the annual meeting- of 
the American Educational Research Association, Toronto, Canada March 
. 27-31, 1978. 

Weber, M. B. The effect of choice format on internal consistency . Emory 
University^ (ERIC Document Reproduction Service No. .ED l6l 940) 

Wetten, W. Relative effectiveness of single and^ouble multiple-choice 

questions in educational measurement . New York, N.Y.: Paper presented 
at the Annual Meeting of the American Psychological Association, 1979. 
(ERIC Document Reproduction^Service No. ED 185 097) 

Williams, R. L, Davis, W., Anderson, >. , & Favor, K. Test format as a form 
of bias for black students. Journal of Non -white Concerns in Personnel 
j and-Guidance , 1978, 6, 141-14T " 

Yap, K. 0., tstes, G.- D., & Hansen; J. B. "Effects of data analysis methods 
and selection procedures in degression models.. Paper presented at the 
annual meeting of the American Educational Research Association: San 
Francisco, 1979. * ' 

/ 

Yap, K. Y. Can selection tests be u$ed as pretests? Paper presented as part 
of a symposium aUthe annual meeting of the American Educattonal Research 
Association, Toropto, Canada, March 27-31,>l978." 



- Appendix 

Letterj to Distritt*Ttt/eL Directors 
Explaining^Purpose of Project 




s 



178 



V 



c 
4 



i 



ERLC 



135 



^ f 



■ : . . i7? 

UTAH STATE .U N I VARSITY ■ LO t A N . U TA H* 8 4 3 2 2 



UNIVERSITY AFFILIATED ) 
EXCEPTIONAL CHILD CENTER 
JJMC68 



March 24, 1980* > 



• r 



Dear 



Recently you received the attached letter from Kent 
Worthington explaining that the Utah^State Office of Educa- 
tion had received a contract from the United States Office 
of Education- to investigate the evaluation models required 
by the new Title -I Evaluation and Reporting System (TIERS) . 
Staff from the- Psychology Department at Utah State Univer-' 
sity have been asked to assist the State Office of Educa- 
tion in collecting data regarding the .effectiveness arid 
applicability of Mode-1 A. fn particular we will be study- 
ing Model A s underlying assumptions about testing levels, 
dates and procedures, and vrfiether they are -elevant to the* 
real needs of the school situation. . 1 

-Hopefully, the investigation .of these assumptions 
will help LEA's and SEA's to; ' . 

a) make better informed decisions about the 
selection of a local evaluation, model for 

. Title I programs; • .' . 

b) better interpret the resutts from evaluations ^ 
. .using ModeMU and 

' ,' • I • 

c) avoid the violation of Title I assumptions. 

To help achieve these objectives we wo"uld like^to in- 
terview LEA personnel in thirteen Utah^chool Districts 
about the procedures of test administration, the rationale 
for test selection, and any type of-problems ttfey may have ' 
had in implementing the models. We wouTd- also like to ob- 
serve some of'tTie Titjle I testing in each district to 
verify and expand the. data col lected during the.'iriterviews. 



V 



ERIC, 



801-750-1981 



The interview and observation data wouJd be collected 
by trained Utaji State University graduate students. durinq 
the time you normally administer the standardized test i/ * 
conjunction with Title I. The interviewing an£ observation 
procedures hive been designed to be as unobtrusive as pos- - 
sible. During the coming week we will contact you by phone 
to answer any questions you have about the project and ex- 
plain in more del at 1 what we wouVti like tp accomplish during 
the visit; At that time we, would also appreciate your 
assistance in identifying which school or schools .would be 
best to visit. * 

Althobgh particaption in the project 1s volufttar^, , 
your cooperation will greatly enhance the effectiveness of 
the study. Should you have questions Qr concerns you would 
like to discuss before we are able to talk with you on the 
phone, feel free to contact Kent Worthington (801 533-6092), 
or myself (toll f/ree 800 622- 5420). I look forward to talk- 
ing with you more in the near-future. 



Since rely; 



Karl R. White, PhD 
Director Planning and Evaluation, 
. Exceptional Child Center and 
Assistant Professor of Psychology 



< 



4 



197 



* 181 

,f . , . r . . 

• 250>&tST 500 SOUtH STREET • SALT LAKE CITY, UTAH 84111 • TELEPHONE (801) 533-1431 

UTAH STATE OFFICE 
- . .-. OF EDUCATION 




t ' WALTER D TALBOT 

t N STATE SI, PEHt^TESDEST OF PLBLIC ISSTRLCTIOS 



March 18, 1980 



' ' ' J 

Dear r • 

As you know, all states are now required to use an Evaluation Model from the 
Title J .Evaluation and Reporting System (TIERS) to evaluate the impact of Titled 
projects within the state. Your district is one of 13 districts<rwi:thtn'-the state 
which will be, repprting .Title I evaluation data to the Feds during 1979-80 as a 
part of the state's three year evaluation plan. { ^ j^j 

Most districts 1 within Utah^fiave chosen to use Model A to conduct their Title 
I evaluation. The results" obtained from most school districts have been positive ^ 
and demonstrate student gains' beyond achievement from the regular instructional^ 
program. We have observed considerable variance in student gains, among grades, 
schbols, and. districts. Also, the amount of gain varies'among different evaluation 
models, e.g.; A, B, and C, and pre-post testing time intervals, e.g., fall to 
spring, and spring to spring*. In order to determine the reasons for the 'variances 
obseVved'and to increase confidence in achievement data, evaluation workshops and 
individual, consultation session have" beep conducted in most district. State 
Office and Northwest REL-TAC personnel have invested considerable* time in these 
activities in recent months. 

The U.S. Education Department has been concerned. State refinement contracts 
have been authorized to study such matters in greater depth. Last summer a 
contract was granted to the Utah-State Office of Education; it is being* ac< 
by Karl White and 'other evaluators at Utah State Unlvers-ity. They are stuj 
the effects of Model B in depth in one district and will, be doing field wol 
selected districts. - < 

As one part of- this project; we would like to'collect additional data" about 
. the implementation of Model A in each of the districts which will be reporting 
evaluation results -this'year. Personnel from Utah State Unitersity will be 
assisting us in. 'this component of the project and will be contacting your 
district sfortly bo \coordinate ttmes and procedures for- this data collection (in 
some cases, preliminary contact has already beten.made). Data collection in each 
district will consist of a 1 imi ted amount of observational data collected during 




AVAfcD A. HOW, i - 
OtviMn «Lfcog»om AdmmMrMien 
i (Mt| 533-Sttt, 




, 19S 



Adult f dwcot»on and Community St'vctt 
Pt»»m*no*»or> ond Cdvcol»onot Dtvtlopm«nt 
Ttbphon* (801) 533 5061 
Comptntototy ond 8»ltf»guol EA/cotio* 
(801) 533-6092 
Sotcei Idvcation (801) 533 5982 



182 



Maurine McDonald 
, Page 2 • • 
March 18, 1980 



. -1 v 

.Title I Spring Testing and. brief Intends ^gt^fSlS Eruption ' 
district staff, we have t n to pla tnesewt vntm » ^ ^ . 

S SET nsTlt "^technical adequacy of IMtl I* « «i St 

di^c?i^re 1 fr4,r«r u K:r ssaru . 

. Kf's-Jf course, 'voluntaryffult participation wfll contribute greatly to 
the success, of the project. • 

'~ I, you have .additional questions yd u can 

S H te Off 1* of Education or M r Kar Wh e. Dir. tor^ Pl.^ J.^.y , . . 
Exceptional .Child Center, Utah jtate university . ^ $8 , 

' inscribes K'Sr&XTn 2w2l8l. * 1«"< to worMng with you 

in this important project.- ' 

. ^* Sincerely yours, t 



i V » 
Kent t. Worthington, Coordinator 

Title I, ESEA , 



*Jay K. Uonaldsons Specialist 
Title 1, ESEA 



dd 



V 



199 



( 



183'. 



Appendix 2 • 

f Letter to Principals of Schools 

Visited During Project 



9 

ERIC 



20l 



184 



. ' UTAH STATE' U N I V E R S I TY L 0 G A N , UTAH 8 4 322 



801-750-1981 



1 EXCEPTIONAL CHILD CENTER 
I UMC68 



March 26, 1 9ST0 



Dear , ' 

# The Utah State Office of Education has received a contract 
from tfie United States Office of Education to conduct a study of 
the evaluation models which the federal government now requires 
all states to use in the Title I evaluation and Reporting System 
(TIERS). This project is under the direction of the Stat£ Title 
I Director, Dr. Kent Worthington who is being assisted by select- 
ed staff from the Psychology Department at Utah State University. 

Recently, we corresponded with your district -superintendent 
and spoke with your district Title I director and th$y agreed to 
have your -district participate in the project. As noted in the 
attached letter from Dr. Wor>thingtbn, the basic purpose ot\the 
project is to evaluate the effectiveness and appl icabil ity of 
Title I evaluation Model A for collecting? data, about the state's 
Title I programs: Data from the project will not ,be used to* make 
funding decisions or statements or worth about an individual 
districts Titles I program. 

"As a'part of the project we would like to visit ycfur school 
during* the time you are conducting the post testing for your Title 
I program. During this time we want to observe students 1 reactions 
to the testing and interview school faculty about their preceptions 
of the strenghts and weaknesses of the Model A evaluation. ' These 
data would be collected by trained graduate students from Utah State 
University. Data collection procedures have been designed to be as 
unobtrusive as possible and will require very little <Jifect; time 
from any individual member of your staff. 

It is 0U5 understanding that you will be doing your post test- 
ing during the week of . Shortly after^you receive this 
letter, we will contact you by phone to answer v any questions you 
have and, if you agree to participate, work out the details of our' 
visit. * 



ERIC 



2Qi 



Should you have questions) before .we contact you, please > 
feel free^tocall Kent Worthington (801 533-6092) or myself 
(toll free 800 662-5420). I look forward* to talkirvg with 
you more in the -near future. ' $' ^ 

\ Sincerely j 



Karl White 
Director 

planning & Evaluaticfh 
Exceptional Child- Center and 
Assistant Professor of Psychology 



. ) ■ 



20? I 



t 

186 



■ ■ • • J 

Appendix 3 

Memorandum Provided to Principals for 
Informing Teachers About the Project 



\ 




\ 



/ 



. 203 



UTAH STATE UNIVERSITY 
University Affiliated^ ^ 
Exceptional Child Center 

v MEMORANDUM 

To: • . * 

From: 

Stib^ct: USOE Study Concerning Title I Evaluation Models 

Date: 



The Utah State Office of Education" has received a.contract from 
. the United States Office of* Education to investigate the effective- 
ness and applicability^ the Title I Evaluation and Reporting^ Sys- 
tem (TIERS) which the'Tederal government now requires all participat- 
ing 'Title I programs to usfe in evaluating their projects. As a part 
of the study, project staff will be visiting our school on : 
During their visit they will be observing -students wljOj are taking 
tests and talking briefly* with some of us about the testing -proce- 
dures and our reactions to the present system of evaluating our Title 
I program.. • 

Observation of the tes£jng should not disrupt your normal oper- 
ation at all, but I wanted you to be aware of the s'tudy .so you would 

\ not be sururised by the presence of an unfamiliar person. At 8:00 
on the day of their visit, project staff will be available in room 

. # for anyone whcjhas questions Or would like additional infor- 
mation about the project.. In 'addition I will be contacting some of 
you to a^ange for a time (approximately 15 minutes) that you could 
visit with one of the project staff about some of your preceptions 
ofthe^currently used Title I evaluation system* Should you have any 

. questions, ^please feel free to contact me. }t K% 



* ) 



r 

♦ 



9 

ERIC 



204 




188 



Appendix 4 



♦Interview Guide Sheet for Collecting Data From 
LEA Personnel Reg&rdi*g Implementation 
, of TIERS' 




.'\' 



ERLC 



206' 



DISTRICT 

SCHOOL* 

DATE" 



INTERVIEW 
TITLE I PERSONNEL 



POSITION 
6RADE 



INTERVIEWER 



189 



1. Student's reaction to testing^ 

a >; How do students feel about 
the testing (positjve, 
' negative* apathetic)? 

b) Do they understand; the 
. ' purpose of the tests? 

.c) How do the students usually 
behave during testing? 

d) Do they try to do their best 
on the tests? 

2\ District personnel reaction to 
testing; 

a) *Do you think the testing is 
V worthwhile—worth the time 
and effort it lakes? 

* b) Do ycu use pre/post subtest 

for purposes other than to 
compare gains? . Specifically 
how? 

• Does) anyone individually 
jLL^tfuss W1 "th students/ * 

parents the results o< the 
Title r testify? 



t f 



3. Selection of students:', 

. a) What is the selection process?' 
Per cent of out of level'?- 

b) Does the selection process 
S • wart? Do you think the 

* career students are being 

selected? 

c) Separation of pre antf selec- 
tion, test for all students. 

d^ How are new move-ins. selected? 
WfYat percentage of total 
students are new move-ins? 



t3 < 



Interview: fit-le I Personnel 



i 0 

4. Test administration: 

a) Do admlnistratioiVdates match 
empirical norm di^es (pre and 
post)? .(When did you admin- 
ister your pre test?)- If not; 
do districts do any extrapola- 
tion? • ' | 

1 6 - t ' 

* 

b) What types of things are done 
to prepare, students. for, 

^ testing? teaser preparation? 



, c) How did you select particular 
test and form/level used in', 
selection, pre and po§t? 
Per cent of out o^level? 

» : ■ — ■ 

* 

d) Whe/i and how are'makfc-'ups done? 
Estimate percentage of students 
% >ho miss original test'ing and 

l)*take make-up and 2) never ^ 
get make-up. 



e) Who is responsible for»turning 
data from Title I testing into 
USOE?, | 

\] Is reporting format any good 
(strengths and weaknesses)? 

2) What checks are* made to 
assure accuracy? ' 



! 



Appendix 5 

Data Collection Form and Definitions of 
On^-Task/Off-Task* Behavior for 
Classroom Starvation. 



191 



20 S. 



OBSERVATI( 



TITLE" I TESTING 



■rod" 



1 

2 

3 

4 

5 

6 

7* 

8 

9 

10 

11 

;12 

13 
14 
15 
16 
17 



STUDENT^ 
5. C " 



f 



STUDe.NT_ 
S 10 



S t JDE:j7 



15 1G 



1 
2 
3 
4 
J 5 
6 
7 
8 
9 
lO- 
ll* 
12 
13 
14 
15 
16 

17 



sti:e::t 



i 

» 
3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 







f 

— J- 


r 














































- 








r 




































f — 









































1 

-2 
3 
4 
5 
6 
7 
8 

A 

10 

11 

12 

13 

1 1 



r 


t 

i 


« 






1 r 






























































- 






















- 































-J 

1 

..J 



1 

2 

3 

4 

5 

6 

7 

8 

9 



TOTAL ONTA$K 
W ASK 



TOTAL ONTASK TOTAL ON TASK TOTAL ONTASK TOTAL .ONTASK 

U.N TASK % OMASK_J % 0,1 A^K - O.'.TASK - ,..'1,.S.: 



";AC; 

22 23 



2«- 



"1 1. 

! 1 










* 














i 




























i 




1 



















• 








































i 






L 


r 

i. 







i 










1 




i. 

i 




H 










t— 




i 
i 










! 
t 




: 


i i 
i 


r ' 

r 

H 




• 










— 






























y 




i 








t 


I 












L J 







TOTAL ONTASK 



NOTES: 



20) 



Directions: Record 4 intervals .on one student before 
observing next .student. Observe 5 students and one 

rnfjflher for a total of 24 Intervals before repeat- 

tjvLg sequence. 



Intervals 

CODh: CXUentask (for entire interval) 



3 second observation 
2 second record 

Beginning* of time J 
I 1 Off task (for any* part of interval) v test 

1 ft" Teacher contact (verbal or physical End of timed test 

interaction with student alone) T~^1 ',, • , , ■ 

I No record made 

(Explain in NOTES 

section) 



Student is finished with test 



TOTAL ONTASK = dumber of "1" recorded for one student 
% ONTASK = Number of "1" r 32 



21' 



JO 



STUDENT 

ON-TASK BEHAVIOR 
DURING TEST TAKING 



Behavior 



' Raising hand 



Asking questions 



Looking 



Talking (audible) 



Teacher Directed 



Example 



"WftTte looking at the teacher* 
-during direction giving 
-during test taking 



concerning: 
-di rections 



at: 



-teacher 

-test paper when so directed 
-board w^en so directed 
-finger^ if counting 



! to: 



-teacher when called upon 



Nonexample 



while looking at: 
-another student 
-the test 



concerning: ^ 
-answer to questions 
-drinking water 
-using bathroom 
-brok^i pencil lead 
-eraser 



at- 



-another student 
-another's test 
-observers 

-desk (inside, outside) 
-toys 

-fingers 'If not counting 
-wrong test page 
-clothes 



to: 



-another student 
-self (if words audible) 
-teacher when not called 
uDOn 



Student Directed 



Example 



while looking at the teacher 
-during test taking 



concerning: 15 
-di rections 



af 



-test paper 

-teacher when hand raised 
-prepared material for 



finishers 



early 



to: 



.-teacher when called upon 



Nonexample 



[ while looking at: 
[ -another student 
1 " -.he test 



concerning: 

-answers to questions 
-drinking water 
-using bathroom 
-broken pencil lead 
-erased 

at: 4 t 
-another student 
-another's test 
-observers 

-teacner unless- hand raise 
-toys 

-fingers if not counting 
-desk (Inside, outside) 
-wrong test [jage 



-clothes 



* to; 



-another student 
-self (1 f words audible) 
-teacher when not called 
upon \ 



Body movement 



Definitions 



er|c 21 £ 



picking up pencil from floor 
scooting chair or desk less than 

10 inches , 
scratching body 

writing answers to test questions 

whert directed 
playing with clothes. 



writing on desk 
standing up (body leaves seat 
hands on another's desk 
hands on another 'studeot or 
teacher ^ _ 
, writing' answers to questions 
wnen not directed ~~ 
throwing anything 
kicking another's chair or 
desk 

leaning back on chal r • 
tapping pencil 



) picking up pencil from floor 
scooting chair or desk less than 

j 10 Inches 
scratching bod/ 

wr1t1no answers to test questions 

when directed 
playlncj with clothes 



Jwrftlng on desk 
| standing up' 



Teacher 'Directed: ^ 
Teacher gives directions 
Teacher reads the question for *ach Item. 



hands on another's dos^ 
hands on a-iot^cr, student 

teacher ^ 
writing answers to questions 

when not directed 
throwing anything 
kicking another's chair 
desk 

leaning back on chair 



or 



or 



leaning back on 
tapping 1 Pencil 



Student Directed 

Students work* at their own pace throughout test 
Jest Is usually timed 
fteglns when teacher says "Ready? Go!" 
Ends when teacher says "Stop! Close your booklet- 



t 



212 



Co - 



X 



9 TEACHER BEHAVIOR 
DEFINITION 4 Or^N-TASK BEHAVIOR 



194 



Behavior 



Talking 



Moving 



ExampT? 



To class or individual student: 
f 

- explain the directions or 
answer format 

- answer queit i&ps about 
directions 

- give directions * 

- read^questions 

-.praise listening or working 



To aide: 

- but only to alert to non- 
attending student 

- if students are on incorrect 
jtem or have a broken pencil 
lead " 



-Standing in front of room 
Pointing %o nonattenders 
Providing a pencil 



Nonexarhple ~ 



To class or individual student: 

if students are on incorrect 
item orMtfve a brokeh pencil 
lead * ' 

to explain answer * ^ 

to help formulate an answer, 

-'to threaten, criticize, or 
reprimand * * 

- to repeat, questions that are 

be given, only once J 

To class during TT. 

To &ide:' f 

- except to alert 

To another teacher 

To co T muni cation system * 



Standing with back .to any student 
or where faces cannot be seen. 

Sitting i * 1 

Lying down . ; 



Looking 



9 

ERLC 



•*t individuals in the class y 

- after' reading each* sentence 
in the "directions 

J- after reading each question 
At clock 

At aide to alert to nonattending 
.student 



At: 



\ 



21 J 



another teacher 
textbook 
lesson plans . 
classroom equipment 
magazine/book 
manual {only during TT) 



, . .196 

SCHOOL 1 0 . GRADE 



DISTRICT >_ . DATE 

'SESSION g ■ OBSERVER 



, ;' DID THE TEACHER DO THESE BEFORE 

ADMINISTERING THE TEST? 

- 

Class Environment i" * 

" - .' • • ' ■ • • - V 

. 1. Arrange trie students ' desks so they are not touching. ■ ( 

2. Position the desks tp face the same .direction (every booklet and 

student'.s face can be seen from the front of the room). r ^ 

3. Assure that the room is comfortable ((emperature, ligtit, noise). 

.4- Post a "Testing, Do Not Disturb" sign on door. 

5. H^ve a visible supply of pencils.' 

. 6. Have a visible clock or watch with a minute "hand. 

7. Create a generally positive climate, that promotes good work habits and 
is without pressure or tension. :, 

B. Seat the npst frequently nonatten^ding s-ttfdents in the front. / 



Student Preparation * 7 

Provide an opportunity for using the b^throomy drinking water, and 

@ ' sharpening. pencil. * ' 

4 ' 2. .Provide all students with pencil and an eraser. 

*3. Ask students to remove norrtesting material from desks if appropriate. 

* 4. Explain the reason for the test (to use the information to help teach 

y students). 

5. Obtair> the attention of the entire class for 1 minute prior to 

directions (all students watching teacher)., 

6, Pass ,put test booklets in less than 2 minutes and in an orderly and 

efficient manner. 

7. Verbally reward attentive behavior. - v 



Reminders 



1. Not to leave their seats bat co raise a hand if something* is needed. 

2., What to do if they finish before^ time is up (TT only). 

'3. To check their work if they finish before the time is up (to see if 
every. question is answered only once). 

4. That seme of the items will be more difficult than their\ daily work. 

5. To sktp an item that they don't know and go on to v the next one. 



• Positive Atmosphere 



1. 

2. 
3. 
4. 
5. 
6. 
7. 

8." 
9. 



197 



Praise individual students for appropriate behavior. 
Praise^cjass^for 1'istening and working. , 
Smile frequentl^. * 

Make less than two reprimands, threats, or criticisms during the subtest. 

Speak with a-gentle, but firm voice/ 

Use physical touch to prompt and reward on task' behavior. 

Start the test directions withir\ several minutes of sitting down so that 
students did not become, restless with preparation activities. 

Quickly supply a student with pencil or eraser when needed. 
Stand near front of room where all stents can fee easily. 



Reading Directions 

^_ V- * Look at class between sentences. 

Survey the class to check if directions we're followed (i.e. "Put your finger 
on the sample," "fill in the circle," "write your name," "turn to page 12"). 

Alert the aide to nonattenders. • • . m > 

Proceed to next direction only after all students are ready. 

Supplement printed directions with verbal and visual explanations when 
students do not understand the procedure. 



2. 

3.' 
4. 
.5. 



6. 



Change wording of directions to a vocabulary the students are familiar with 
(i.e. "circle" instead of "oval" or "box" instead of "frame". V 



Beading Test Items 

Look up after each question and glance around room. 



1 . 



2. 

3. 
4. 
5." 



Follow the exact wording of questions as stated in the manual, 
define or explain words or illustrate procedures.) 

Allow approximately 10 seconds between items. 3 

Never repeat a question -unless the directions'specify to do so. 

Alert aide to nonattenders or to students with raised hands. 



(Never 



Timed Test s 

1. 

2. 



Set clock for correct time' requirement. " _ 

Watch students during entire test to detect speeding, si ow answerinq, 
day dreaming and cheating. • 

Alert aide to nonattenders or to students with raised hands.. 



End of Test 

1. 

2. 

3. 



Praise students for working hard. 
Collect booklets in directed manner. 
Provide a directed, stand-up, rest period. 



2u: 

I 



/ \ 



Appendix 7 

Interview Guide Sheet' for Teachers to 
Prioritize Curriculum Areas 



" ■ % 17 



DISTRICT 

SCHOOL 

DATE 



TEACHER RATING OF ITEMS 
yROM STANDARDIZED TESTS 

' . ' POSITION 

GRADE 



INTERVIEWER 



199 



•Rank by % of 

Importance A Teaching 
Time 



Category 



Rank by 
Importance 



Phonics 



1. 
2. 
3. 
4. 

5. 
t 6. 



Consonant soynds (blends) 
Vowel sounds (long and short) 
Consonant' digraphs (ch, wh, sh 
Vowel digraphs/diphthongs 
(ai, ay, ea, ee, orf~ )w, etc.) 
Controlled vowels (al, or, aw, 
Variant vowels (said, was, etc 
Other 



, th) 



B. Vocabulary 

1. Word meaning 

2. Contextual clues 

3. Analogies 

4. Sight Vocabulary 

5. Other 



C. liberal Comprehension 



D. Inferential Comp r ehension 

E. Structural Analysis 

1. Root words * 

2. -Syl la^>ication 

3. Affixes 

4. Compound words 

5. Contractions 
• 6. Other 



C 



F. Other: 



201 



HOW TO FIGURE STUDENT / TEACHER RATIO 
• • (With some go'od luck) 



ayl, 



You may have found, as we hayfe, that determining the student/teacher 

," 'ratfo of a Title I program can be a trying Snd frustrating experience. 
t 

' The current instructions are yatjufr and do not address the special "needs ^ 
of unicjue treatment activities. It is, of cfllrsr, impossible to. develop 
instructions that will account for every contigency. These instructions 
were developed to better clarify the student/teacher ratio computation > 
procedure for a greater variet^of programs. Unfortunately, the broader 
appl^bility has necessitated a greater, complexity^ We hope it is" com- 
prehensible. 

The computation procedure will' be explained by a series of directions, 
and illustrated by hypothetical examples. These examples have been arranged ■ 
^ in the^appendix in a grid pattern. Each direction will have a letter and/or 
number associated with it that refers to particular space in the grid. The 
grid is a suggested format for compiling ten types of data. How to do so 
and what to do with it will be explained. 

We have indentified six- basic teaching mqdes. A small group with 
wnore than one grade level -row A, a small group that also serve non-target 
sudents-row B, a large group-row C, individualized instruction-row D, and 
peer group tutoring-row E. 



9 

ERIC 



COLUMN 1 - MODE OF ACTIVITY- 

The mode of activity is the type of Title I treatment situation in 

t 

• — *■ • 

which the Title I target children of a particular grade participate. This 
could be a small group activity. 'If there is more than one small group, 
each group must be entered individually (1A and IB = 5G). Other modes 

, \ 220 



202 



could be large groups (1C = LG), or'ind'i viduali'zed instruction .(ID. and 
IF - II). Again individualized instruction is-entered twice, but in this 
instance, because there two different instructors (10 D and 10 F = Aide 
and P0). * 

The most conrnon variant encountered is supervised^peer tuto|?'ng. This ' 
variant should be entered as a'smal'l group with an explanation (IE =5G). 
Peer tutors are not paid Titled employees and cannot therefore be considered 
instructors. •Unsupervised peer tutoring should' not be entered at all. » 

\ 

• * % 

r 

COLUMN 2 - TARGET-GRADE LEVEL STUDENTS ' 

Column 2 refers to the number of students from a particular grade that 
are involved in each separate activity listed in column 1. Some children 
are probably involved in more than one Title I treatment activity. For 
instance, one child may be in a small group at one time, a large group at 
another, and individually tutored at sti 1 1- another time. This child should ^ 
be entered three times, once for each activity. It is obvious that the 
toUl number of students involved in the various activities in a certain 
grade level willj generally, far outnumber the total number of target. stu- ' 
dents for*that grade. This occurs because the students may be counted more 
than once. 

A problem arised in those programs in which more than one grade level • 
is involved in the same Title I treatment activity.. Enter only the number 

of students for the grade of immediate concern 'in column 2. Row A shows A 

. ... # 

situation in. which there is a small group of seven students of mixed grades . 

(A6 = 7), but only three of which are in the second grade (A2 = 3). 



ERIC 



221 



7.- A related pr6blepi arises whep non-target children are sometimes includ- 
ed in Title I treatment groups. This case occurs in row B.i In this case, 
enter only the number of target children being served in column 2 (2B = 8). 

COLUMN 3 - 'LENGTH OF ACTIVITY., 

Enter th6 length of time -en individual student would be served during 
one treatment session. A student would be served for thirty minutes in a 
small group (3 A and B and E = 30), fifty minutes in a large group (3C = 50) 
and fifteen minutes in an individual tutoripg session (3 D and F = 15). 

COLUMN 4 - FREQUENCY OF ACTIVITY., 

Enter the approximate number of times per week that each seperate 
afttivity is served. Take.,into cpns iteration, if possible, holidays, assem- 
blies, half days and other rel event events that may lessen the frequency. 
For only the individual instruction sessions account for th^ average absence 
rate for the students in addition to the prior consideratyjps. 

COLUMN 5 TARGET STUDENT TREATMENT TIME/ . 

Multiply, column 3, 4 and 5 together for each indi-vidual row. 

COLUMN 6 - TOTAL NUMBER OF STUBENTS/ 

* Column 6 will generally be tjle same as column 2 except in those cases 
were non-title I children - are befing served with the Title I target students 
and in ca^e were more than on? grade level is being served 1n the samp 
session as well. In, these cises anter the total number of students tiding 
served in each seperate. activity, regardless of the students level <5r Title 
I status. 



204 



COL-UMN 7.- NUMBER OF INSTRUCTORS. % 

Q • 

Enter the number of -instructor's .involved with each activity. If there 
is rndt-e than one instructor .in a gi ven treatment session such as'a pull out 

teacher and aide (7B = 2-)„ enter the total number of instructors. 

* > 

COLUMN 8.- INDIVIDUAL TREATMENT S/T RAflOS. 



Divide the number of students (column 6) by the number of instructors 
(column 7) for each seperate . treatment. The lai*ge group in row C, u an • 
example, has 21 students (C6 21) and one instructor (C7 = 1), thuS-21 
divided by 1 = 21 (CB = 21). The student/teacher ratio for all individual 
instruction sessions is .automatically one (8 D and F = 1) regardless of how 
many students are treated 1 . . 



COLUMN 9 - ADJUSTED STUDENT TREATMENT TIMES, 

Divide the Target Student Treatment Time (column 5), by the Individual 
Treatment S/T Ratios (column 8) for each indtvidual row and errter the fiqure 
incolumn 9. For instance, in rtiw C, Total Student Treatment lime =4200 
minutes (b§ =4 200), and the Individual Treatment S/T Ratio = 21 (C8 = 21). . 
Th'us 4200 divided by 21 '= 9.5 (C9 = 240). 



COLUMN 10 - TYPE OF INSTRUCTOR. 

Simply enter the Title of the instructor or instructors involved in 
each treatment session. *»• - 



223. 



COMPUTATION Of 'THE OVERALL STUDENrV TEACHER RATIO. 

TO conpute the overall student/teacher 1 ratio. for any grade follow 

these steps. * * 

L - 

' 1. Add together all the figures in column 5, (total = 6945) 

• *.•-. ** 

2. Add together all the figures in column 9, (total = 13311.3) 

' 4 \ 

3- Rivide the sum of coliumn 5 by the sum of column 9, to obtain 
the student teacher ratio = 5.29.. 

• /. ■ •-. 




0 



1 3 4 5 ' 6" - 7 8 9* 

Target Total 

Frequency Student Number. Number Ind. 

of Treatment 'of of 4 S/T 

Activity Students Activity Activity "Time Students Instructors Ratios 



Mode 
of 



Number • Length 
of * of 



SG 



SG 



LG 



21 



30 



30 



450. 



T200 



^50 



4200 



10 



21 



21 



10- - 



Adjusted 

Stu-dent Type * 

Treatment of 

Times Instructor 



64.3 
i 



240 



200 



P.0 



/ 



P.O. 
+ 

Aide 



Teacher 
Leader 



II 



SG 
| Peer, 
Tutor 



II 



o 225 

ERIC • 



" 5 



11 



15 



240 



n 



l ! 



30 



15 



i { 



360 



* 1 



495 



11 



5 • 



240 



72 



495 



Student teacher ratio = ' 



1311 .3 



Aide 



P.O. 



P.O. 



no 
o 



5.29 

22b 



/ 



Appendix 9 

Letters of Approval Support, & Notification 
Regarding Extended Work Scope Project 




tetter of support from Title, I Director 



208 



Salt Lake City School District 

440 East First South Salt Lake City, Utah 841 1 I Phone 322-1471 

October 11, 1^79 



e vir 



Ms. Cie Taylor 
Exceptional Child Center UMC 68 
Utah State University 
Logan, Utah 84322 ' 

Dear Cie, j 

I appreciated the opportunity to talk with you about the 
problems of group achievement testing with low achieving and 
learning disabled students. ^With you, I am concerned that because 
motivational problems and administration procedures, these test 
results may not be indicative ofthe students' true achievement 
level. Your proposed study -sounds like an excellent approach 
to beginning to provide answers in' this important area. 

As I explained to ydu, our District has a committee which 
nust approve all outside research. * Before we could give official 
approval for you to conduct tbe project in our District ijt would 
v have to be cleared by this committee. However, because the outcomes 
of the project would be central to many of oujr District concerns, 
I do*not anticipate any problem? in obtaining this approval . 



/ Good luck with your pjy>ject! 

more about it from you. ' 

- Sincerely? 



I look forward to hearing^ 



Darlene Call 
Administrator for Educational Accountability 




32 s- 




SaU Lake City Public School approval 



209 



Salt Lake City School District 

440 E.iM Hist South Salt Like City, Ul.ih 841 1 I Phone: 322-1471 

April 15, 1980 v 



Mr. Joseph A. Gappa 

Office of the Vice President for Research 
UMC_ \k 

Utah State University 
Logan, Utah 8^322 

Dear Mr. Gappa: 



t 



District personnel in the Salt Lake City School District have reviewed 
the components of the research propose! entitled The Effect of Reinforcement 
and- Tea i nin g 'on Group Standardized Testing Behavior of Mildly Handicapped 
and High Risk Students . 

We grant approval for the implementation of the researches proposed 
and endorse the efforts of project personnel to increase the validity of 
group administered standardized i nst runjpnts . 

I have read the Informed Consent Format and understand the elements 
therein. We feel that the rights and welfarS of the subjects (second grade 
students) will not be violated under the research provisions. To insure 
parental approval, each parent will be provided with a letter explaining 
the stuebj^and will have an opportunity to withdraw their child from the 
observation, training, and reinforcement procedure. 



S i ncere 1 y , 




SRM:.ab 



|/ cc: Ci e Taylor 



Stanley R. Morgan, Administrator 
Research and Public Itjfcmpat i on 



Request for Salt City Schools approval 

210 

UTAH STATE UNIVERSITY LOGAN, UTAH 84322 



801-750-1981 



unWersity affiliateo 
EXCEPTIONAL CHILD CENTER 
UMC68 — 



April 4, 1980 



iDr. Stanley Morgan 
'440 E 1st S 

Board of Education' v \ 

Salt Lake City School District 
Salt Lake City, UT 84111 

Dear Dr. Morgan: 

Thank you for reviewing'and approving the research proposed in 
The Effect of Reinforcement and Training or Group Standardized Testing 
B^iavior of Mildly Handicapped and High Risk Students . In response to 
our phone conversation on Monday, March 17, I have enclosed the documents 
necessary for your affirmation that the rights and welfare of children 
will not be .violated by implementing the project. The following items 
are included: 

1. A copy ofthe Informed Consent Format. This form is provided 
for your information and need not be filled in or returned. 
It is referred to in the letter to be -sent to Mr. Gappa. 

2. A draft of the letter wlrich is to be sent to "Utah State 
University to assure that the rights and welfare of the 
students have been protected. 

3. A draft of the letter to be sent to each second grade parent 
(from the principal) to explain the research. 

«/ 

Please read the draft of item 2 above and feel free to edit it to 
Stilt your needs. The letter should be sen,t to Mr. Gappa with a copy to me. 

t Thank you, Dr. Morgatrr-for being so helpful. You were so pleasant on 
the phone, and I hope we can meet in, the near' future. Please call me 
collect (7SO-2044) if I can assist you in any way. ' 

» 

Regards, 

/ 

Cie Taylor 

CT:rrm* 
Enclosures 



RIC ' - i 



r I 



Notification to parents 



211* 



. April 4, 1980 



Dear Parents: 

The Salt Lake City 
University this year.on 
using group standardized 
The project will train c 
teachers in test adminis 
can eliminate many testi 
for testing. The tasks 
difficult and less negat 



Schools are working in cooperation with Utah State 
a project designed to investigate the validity of 
tests to measure student academic achievement, 
hildren in test taking "strategies and train their 
tration practices. Both of these training programs 
ng problems by preparing the students and teacher 
of taking and giving tests may also become less 
ive. «0 



A total of 24 second grade classrooms in 12 schools located in Salt Lake 
City are included in this study. Your child's classroom is one of those 
participating in this project. Children who participate, in the project will 
be obsorved'-during the regularly scheduled District-wide spring testing 
(April 28 - May 1, 1980). As normal, all test scores will be kept confidential 
and only group scores will be used to report data. 

In order to compare the "performance of those students who are trained 
with those who are not trained, only some students will receive the instruc- 
tion'in test taking. Should your child be chosen for training, one or two 
hours of instruction is being provided during school hours up to two weeks 
before the actual testing but will not otherwise interfere with your child s 
. regul-ar work. • 

In addition to providing some students with training 1n test taking, 
all students will have the opportunity to earn a monetary reward for doing 
well on the test. The average amount of the reward to be given to students 

' W im be $1.00. Each child will have an ' individual goal of a specific test 
score that is set before the test. If your child attains this Individual 
goal, he or she will earn a reward on the day following the test. Depending 
on the group to which your child is assigned, these rewards may beearned 

.for math or for reading gains. t 

Our preliminary results Indicate that these training programs will 
benefit most elementary teachers and pupils in the Salt Lake -City Schools'. 
•However, 1f you have any questions regarding this project or the training, 
please contact the me for further information. If for some reason ^ou 
would Drefer that your^child not participate in the study, you may notify the 
school office and your child will not be included in the training, observa- 
tion, or reinforcement., Thank-you for your cooperation in this project. 

Sincerely, 



9 

ERIC 



C1e Taylor 
Research Assistant 



231 



To principals informing them of letter to parents 



March 31, 1980 



Dear 

Our meeting on Thursday, March 27, was not only informative 
but delighfful. I am pleased that we had the opportunity to chat 
for a brief time while reviewing the plans for the spring testing 
(SAT). I spoke with Maurine McDonald after seeing you and she 
will be mailing you a testing schedule^-this week. 

Jn reference to the letters to be sent by you to parents ex- 
plaining the testing program, you suggested, that >W school send 
copies home with the students. I had agreed to provide you with 
a draft of this letter for your editing. However, through an error 
in communication the draft was duplicated at the District office 
and delivered to you in- bulk. If you haven't already received this 
package, it will probably arrive this week. 

I apologize for this error. Feel free to change the letter 
to suit your style and make your own duplications. This letter 
should be sent home with all second grade- students in the class- 
rooms that were chosen »to participate in the study. This informa- 
tion is included on the project outline I left with you during rny 
visit. m 

Thank you for providing so much cooperation in dur spring 
testing project. I will beycontacting you shortly regarding the 
exact scheduling of project activities. Please call m collect 
(750-2044} if additional concerns arise. 



Regards, 



Cie Taylor 
Research Assistant 



r 



CT:dg 



r • 

232 



214 



UTAH STATE UN IVERSITY • LOGAN, UTAH 84 3 2 2 



801-750-1981 



UNIVERSITY AFFILIATED 
EXCEPTIONAL CHILD CENTER 
UMC68 



May 19, 1980 




\ 



ir Parent: 

Utah State University and Ellis Elementary School are 
jointly cooperating in a project to produce^ a short filmed 
teaching sequence , that) is designed^fo improve the test taking 
skills of students.^ We have selected your -child to participate. 

as, a student in the filming. "The filming w^ll take place on 

May 21, 198Q in the second grade classroom dur.fng the afternoon 
sessfon. 



.'A 



Attached is a release "form granting permission to film 
your child arrd use the videotape for educational purposes. 
Shpuld you have further questions, please contact your child's 



teacher 




Larry Jacobsen 

Principal, Ellis Elementary School 



S 



i \ 



9 

LERjC 



234 



----- / 

215 

UTAH STATE. UNIVERSITY • LOGAN, UTAH 84322 

* * * 801-750-1981 



UNIVERSITY Af F ILlATf 0 
EXCEPTIONAL CHILD CENTER 
UMC88 



vLOCATION: 



^ . ; TEACHER: 

-J 

RELEASE TO USf VIDEOTAPES FOR EDUCATIONAL PURPOSES: 



~ I hereby gfant permission and authorite Utah State University to take, use' ■ 
and distribute videotapes of me, reamed below, for the purposes of producing r 
'educational information .and instructional materials in a manner to be selected, 
by the University. I understand that this includes the right to use .and license 
tfie use of such videotapes for any educational purpose, including teach er and 



JLid*LJtrMi^^ trrfnTr^ I agree that I will^ 

not institute or support any claim or suit«of any nature against Utah State 
University or the persons to whom it might '1 icense use or distribution of 
such pictures. - 

z 1 * . . ' • 



Date 



Legal Signature 



Address 



Phone 



ERIC 



. 235 



March 



Dear Vare^ } . ' -\ ; : • ^ 

"Thanks everyone' ior beirvj cooper a j>ve. 
obout the Ytcleptape.'We li.rei mak>^, Some of \ 
you have 'requested opportunity to see 
this ^Im.'We.are^ertto+iYely sotadulmyQ- 
Yi^v/infl ^or Rprday ; Mdr'q^ is or £'.50 P.M. 

please, return b<vH^>nn ,o4 ilv^ hole. 

Ybfl^k you Qgfli'n, 

p.S. They Sure • do" d &u J re^aiD 1 




□ Yes, 



Cormr 



• 



Q I wonr/IoGotxie^ bytdpave ^ conflict- 
better time wolilcl pe. 1 •. 



A 



(~| I* don 5 f. care +c, come, 




•anQTu 



